/var/folders/mq/yzq45gh52xn7zz498z7cjtkw0000gn/T/ipykernel_83061/2124256990.py:34: DeprecationWarning: `set_matplotlib_formats` is deprecated since IPython 7.23, directly use `matplotlib_inline.backend_inline.set_matplotlib_formats()`
  set_matplotlib_formats('svg')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Input In [1], in <cell line: 47>()
     43 get_ipython().run_line_magic('matplotlib', 'inline')
     45 from IPython.display import display, HTML
---> 47 from matplotlib_venn import venn2, venn3
     48 import plotly.graph_objects as go
     49 import seaborn as sns

ModuleNotFoundError: No module named 'matplotlib_venn'

Single TF

$$\newcommand{\kon}{k_{\mathrm{on}}} \newcommand{\koff}{k_{\mathrm{off}}}$$

There exist two mathematical formalisms to describe the probabilities of molecular configurations at equilibrium: thermodynamics and kinetics. In the kinetics formalism, the system transits between configurational states according to rate parameters contained in differential equations, which can describe not just the system's equilibrium but also its trajectory towards it, or the kinetics of non-equilibrium systems. In the thermodynamics/statistical mechanics formalism, we posit that a system will occupy configurations with lower energy, and use the Boltzmann distribution to estimate the proportion of time the system spends in each state, at equilibrium. The thermodynamics formalism is limited to describing equilibrium state probabilities, but it does so with fewer parameters.

We'll derive an expression for the probability of a single TFBS' occupancy with both formalisms, but proceed with the thermodynamic description for more elaborate configurations. It will become clear why that is preferable.

Kinetics

Most derivations of the probability of single TFBS occupancy at equilibrium employ a kinetics formalism, so we'll walk through that first, and then explore the analog in the thermodynamics description. In the kinetics description, the parameters are rates.

$$ \mathrm{TF} + \mathrm{TFBS} \underset{\koff}{\overset{\kon}{\rightleftarrows}} \mathrm{TF\colon TFBS} $$

The natural rates are the rate of TF binding $\kon$ and unbinding $\koff$. Equilibrium is reached when binding and unbinding are balanced:

$$\frac{d[\mathrm{TF\colon TFBS}]}{dt} = k_{\mathrm{on}}[\mathrm{TF}][\mathrm{TFBS}] - k_{\mathrm{off}}[\mathrm{TF\colon TFBS}] = 0 \text{ at equilibrium}$$ $$k_{\mathrm{on}}[\mathrm{TF}]_{\mathrm{eq}}[\mathrm{TFBS}]_{\mathrm{eq}} = k_{\mathrm{off}}[\mathrm{TF\colon TFBS}]_{\mathrm{eq}}$$ $$\text{(dropping eq subscript) }[\mathrm{TF\colon TFBS}] = \frac{k_{\mathrm{on}}[\mathrm{TF}][\mathrm{TFBS}]}{k_{\mathrm{off}}} = \frac{[\mathrm{TF}][\mathrm{TFBS}]}{k_{d}}$$

where $k_{d} = \frac{\koff}{\kon}$ is called the dissociation constant (or equilibrium constant). We'd like to determine the probability of finding the TFBS occupied, i.e. the fraction of time it spends in the bound state. That fraction is $\frac{[\mathrm{bound}]}{([\mathrm{unbound}] + [\mathrm{bound}])} = \frac{[\mathrm{TF\colon TFBS}]}{([\mathrm{TFBS}] + [\mathrm{TF\colon TFBS}])}$. Define the denominator as $[\mathrm{TFBS}]_{0} = [\mathrm{TFBS}] + [\mathrm{TF\colon TFBS}]$ so that $[\mathrm{TFBS}] = [\mathrm{TFBS}]_{0} - [\mathrm{TF\colon TFBS}]$ and substitute:

$$[\mathrm{TF\colon TFBS}] = \frac{[\mathrm{TF}]([\mathrm{TFBS}]_{0} - [\mathrm{TF\colon TFBS}])}{k_{d}}$$ $$[\mathrm{TF\colon TFBS}](k_d + [\mathrm{TF}]) = [\mathrm{TF}][\mathrm{TFBS}]_{0}$$ $$\frac{[\mathrm{TF\colon TFBS}]}{[\mathrm{TFBS}]_{0}} = \frac{[\mathrm{TF}]}{k_d + [\mathrm{TF}]}$$

Note: We could also ask for this expression in terms of $[\mathrm{TF}]_0 = [\mathrm{TF}] + [\mathrm{TF\colon TFBS}]$ however, since we’re considering a single TFBS, $[\mathrm{TF\colon TFBS}]$ is at most 1, and so $[\mathrm{TF}]_0 \approx [\mathrm{TF}]$. In instances of ligand-receptor binding in which that approximation cannot be made, the fraction bound is a messy quadratic. Derivation here.

Thermodynamics

In the thermodynamics description, the parameters are Gibbs free energies $\Delta G$. Let's follow the derivation from Physical Biology of the Cell (pp. 242) and consider the number microstates underlying each of the of bound and unbound macrostates, and their energies.

In order to count microstates, we imagine distributing $L$ TF molecules across a space-filling lattice with $\Omega$ sites. The energy of a TF in solution is $\varepsilon_{\mathrm{solution}}$ and the energy of a bound TF is $\varepsilon_{\mathrm{bound}}$. $\beta$ is the constant $1/k_b T$ where $k_b$ is Boltzmann's constant and $T$ is the temperature.

<!--

StateEnergyMultiplicityWeight
$A \cdot A_s$ $\frac{\Omega!}{(\Omega - A)!A!} \approx \frac{\Omega^{A}}{A!}$ $\frac{\Omega^{A}}{A!} \cdot e^{-\beta \left[ A \cdot A_s \right]}$
$(A - 1) A_s + A_b$ $\frac{\Omega!}{(\Omega - (A - 1))!(A-1)!B!} \approx \frac{\Omega^{A-1}}{(A-1)!}$ $\frac{\Omega^{A-1}}{(A-1)!} \cdot e^{-\beta \left[ (A - 1) A_s + A_b \right]}$
-->

In our case, the number of microstates in the unbound macrostate is $\frac{\Omega !}{L!(\Omega -L)!}\approx \frac{\Omega^L}{L!}$ and they each have energy $L \cdot \varepsilon_s$. The number of microstates in the bound macrostate is $\frac{\Omega !}{(L-1)!(\Omega -(L+1))}\approx \frac{\Omega^{(L-1)}}{(L-1)!}$ and they each have energy $(L-1) \varepsilon_s + \varepsilon_b$.

The Boltzmann distribution describes the probability of a microstate as a function of its energy: $p(E_i) = e^{-\beta E_i}/Z$ where $Z$ is the "partition function" or simply $\sum_i e^{-\beta E_i}$ the sum of the weights of the microstates, which normalizes the distribution. In our case:

$$Z(L,\Omega)=\left(\colorbox{LightCyan}{$ \frac{\Omega^L}{L!} e^{-\beta L \varepsilon_s}$}\right) + \left(\colorbox{Seashell}{$\frac{\Omega^{(L-1)}}{(L-1)!} e^{-\beta [(L-1) \varepsilon_s + \varepsilon_b]}$}\right)$$

With that expression in hand, we can express the probability of the bound macrostate, $p_b$:

$$p_b=\frac{ \colorbox{Seashell}{$\frac{\Omega^{(L-1)}}{(L-1)!} e^{-\beta [(L-1) \varepsilon_s + \varepsilon_b]}$}}{\colorbox{LightCyan}{$\frac{\Omega^L}{L!} e^{-\beta L \varepsilon_s}$} + \colorbox{Seashell}{$\frac{\Omega^{(L-1)}}{(L-1)!} e^{-\beta [(L-1) \varepsilon_s + \varepsilon_b]}$}} \cdot \color{DarkRed}\frac{\frac{\Omega^L}{L!}e^{\beta L \varepsilon_s}}{\frac{\Omega^L}{L!}e^{\beta L \varepsilon_s}} \color{black} = \frac{(L/\Omega)e^{- \beta \Delta \varepsilon}}{1+(L/\Omega)e^{- \beta \Delta \varepsilon}} $$

Where we have defined $\Delta \varepsilon = \varepsilon_b - \varepsilon_s$. $L/\Omega$ is really just a dimensionless TF concentration, which we'll hand-wave as being equivalent to $[\mathrm{TF}]$, which leaves us with an expression suspiciously similar to the one we derived from the kinetics formalism:

$$p_b = \frac{[\mathrm{TF}]e^{-\beta \Delta \varepsilon}}{1+[\mathrm{TF}]e^{-\beta \Delta \varepsilon}} \cdot \color{DarkRed}\frac{e^{\beta \Delta \varepsilon}}{e^{\beta \Delta \varepsilon}} \color{black} = \frac{[\mathrm{TF}]}{e^{\beta \Delta \varepsilon}+[\mathrm{TF}]}$$

From which we recapitulate an important correspondence between kinetics and thermodynamics at equilibrium: $ k_d = e^{\beta \Delta \varepsilon} = e^{\Delta \varepsilon / k_bT} $ more commonly written for different units as $k = e^{-\Delta G / RT}$.

The takeaway is that both the kinetics and thermodynamics formalisms produce an equivalent expression for the probabilities of each of the bound and unbound configurations, parameterized respectively by $k_d$ and $\Delta G$.

Sample Values

In order to compute probabilities like $p_b$, we need concrete TF concentrations $[\mathrm{TF}]$ and binding affinities (either $k_d$ or $\Delta G$). What are typical intranuclear TF concentrations and binding affinities?

Concentrations

A typical human cell line, K562s, have a cellular diameter of 17 microns. (BioNumbers)

def sphere_volume(d):
    return 4/3*np.pi*(d/2)**3

K562_diameter_microns = 17
K562_volume_micron_cubed = sphere_volume(K562_diameter_microns)
print(f'K562 volume: {round(K562_volume_micron_cubed)} μm^3')
K562 volume: 2572 μm^3

A typical expressed TF has a per-cell copy number range from $10^3$ - $10^6$. (BioNumbers)

copy_number_range = [1e3, 1e6]
N_A = 6.02214076e23

def copy_number_and_cubic_micron_volume_to_molar_concentration(copy_number, volume=K562_volume_micron_cubed):
    return (copy_number / N_A) / (volume * (1e3 / 1e18))  # 1000 Liters / m^3; 1e18 μm^3 / m^3

lower_end_molar = copy_number_and_cubic_micron_volume_to_molar_concentration(copy_number_range[0], K562_volume_micron_cubed)
upper_end_molar = copy_number_and_cubic_micron_volume_to_molar_concentration(copy_number_range[1], K562_volume_micron_cubed)

lower_end_nanomolar = lower_end_molar / 1e-9
upper_end_nanomolar = upper_end_molar / 1e-9

print('If TF copy numbers range from 1,000-1,000,000, then TF concentrations range from', str(round(lower_end_nanomolar))+'nM', 'to', str(round(upper_end_nanomolar))+'nM')
If TF copy numbers range from 1,000-1,000,000, then TF concentrations range from 1nM to 646nM

We might also like a distribution over this range. Let's posit a lognormal, where $10^3$ and $10^6$ are the 3σ from the mean, which is $10^{4.5}$. Then $\sigma = 10^{0.5}$

Affinities

What are typical TF ΔG's of binding? How about the $\koff$ rates and half lives?

  • We can use the prior knowledge that dissociation constants should be in the nanomolar regime (BioNumbers).
  • We can use the relation that $\Delta G = -k_b T \cdot \ln(k_d)$ (Plugging in 310°K (human body temp) and the Boltzmann constant $k_b$ in kcal/Mol)
  • We use the approximation that $\kon$ is ~$10^5 / $ Molar $ \times $ sec (Wittrup)
T = 310
k_b = 3.297623483e-24 * 1e-3  ## cal/K * kcal/cal
kbT = k_b*T*N_A
kbT  ## in 1/Mol -- an unusual format

k_on = 1e5

def nanomolar_kd_from_kcal_ΔG(ΔG): return exp(-ΔG/kbT) * 1e9
def kcal_ΔG_from_nanomolar_kd(K_d): return -kbT*ln(K_d*1e-9)
def k_off_from_nanomolar_kd(k_d): return (k_d*1e-9) * k_on
def half_life_from_kd(k_d): return ln(2) / ((k_d*1e-9) * k_on)
$\Delta G$ $\kon$ $\koff$ $t_{1/2}$
$k_d$
1 12.757685 1e5 / (M * s) 0.0001 0 days 01:55:31.471805599
10 11.340164 1e5 / (M * s) 0.0010 0 days 00:11:33.147180560
100 9.922644 1e5 / (M * s) 0.0100 0 days 00:01:09.314718056
1000 8.505123 1e5 / (M * s) 0.1000 0 days 00:00:06.931471806

We learn that an order of magnitude residence time difference results from just 1.4 extra kcal/Mol, and that TF half lives range from about 5s to about 2h. Let's once again posit a distribution of affinities to sample from (defined on $k_d$):

With those concrete TF concentrations and dissociation constants, we can finally plot our function $p_b = \frac{[\mathrm{TF}]}{e^{\beta \Delta \varepsilon}+[\mathrm{TF}]}$.

@np.vectorize
def fraction_bound(TF, ΔG):
    '''TF in nanomolar'''
    return TF / (TF + nanomolar_kd_from_kcal_ΔG(ΔG))

(Note that both $[\mathrm{TF}]$ and $k_d$ are plotted in log space, but $p_b$ is linear.)

Multiple TFs: Direct Cooperativity & Competition

Suppose now that two TFs bind adjacent segments of DNA in such a way that the binding of either facilitates (or hinders) the binding of the other. We call this direct cooperativity (and competition). We'll focus first on cooperativity.

Cooperativity

As before, the statistical mechanics formalism entails enumerating the configurations, their multiplicities, and their energies. We'll call the TFs A and B. We'll denote their counts as $A$ and $B$. The energy of a TF in solution will once more be $A_s$ and bound to its cognate TFBS $A_b$. The energy of cooperativity will be $C_{AB}$.

StateEnergyMultiplicityWeight
$A \cdot A_s + B \cdot B_s$ $\frac{\Omega!}{(\Omega - A - B)!A!B!} \approx \frac{\Omega^{A+B}}{A!B!}$ $\frac{\Omega^{A+B}}{A!B!} \cdot e^{-\beta \left[ A \cdot A_s + B \cdot B_s \right]}$
$(A - 1) A_s + A_b + B \cdot B_s$ $\frac{\Omega!}{(\Omega - (A - 1) - B)!(A-1)!B!} \approx \frac{\Omega^{A+B-1}}{(A-1)!B!}$ $\frac{\Omega^{A+B-1}}{(A-1)!B!} \cdot e^{-\beta \left[ (A - 1) A_s + A_b + B \cdot B_s \right]}$
$A \cdot A_s + (B - 1) B_s + B_b$ $\frac{\Omega!}{(\Omega - A - (B - 1))!A!(B-1)!} \approx \frac{\Omega^{A+B-1}}{A!(B-1)!}$ $\frac{\Omega^{A+B-1}}{A!(B-1)!} \cdot e^{-\beta \left[ A \cdot A_s + (B - 1) B_s + B_b \right]}$
$(A - 1) A_s + A_b + (B - 1) B_s + B_b + C_{AB}$ $\frac{\Omega!}{(\Omega - (A - 1) - (B-1))!(A-1)!(B-1)!} \approx \frac{\Omega^{A+B-2}}{(A-1)!(B-1)!}$ $\frac{\Omega^{A+B-2}}{(A-1)!(B-1)!} \cdot e^{-\beta \left[ (A - 1) A_s + A_b + (B - 1) B_s + B_b + C_{AB} \right]}$

The partition function is the sum of the weights:

$$ Z = \frac{\Omega^{A+B}}{A!B!} \cdot e^{-\beta \left[ A \cdot A_s + B \cdot B_s \right]} + \frac{\Omega^{A+B-1}}{(A-1)!B!} \cdot e^{-\beta \left[ (A - 1) A_s + A_b + B \cdot B_s \right]} + \frac{\Omega^{A+B-1}}{A!(B-1)!} \cdot e^{-\beta \left[ A \cdot A_s + (B - 1) B_s + B_b \right]} + \frac{\Omega^{A+B-2}}{(A-1)!(B-1)!} \cdot e^{-\beta \left[ (A - 1) A_s + A_b + (B - 1) B_s + B_b + C_{AB} \right]}$$

Which we can greatly simplify by multiplying the entire expression by the reciprocal of the "base state" weight, $\color{DarkRed}\frac{A!B!}{\Omega^{A+B}} \cdot e^{\beta \left[ A \cdot A_s + B \cdot B_s \right]}$, normalizing that weight to 1:

$$ Z = 1 + \frac{A}{\Omega} \cdot e^{-\beta \left[ A_b-A_s \right]} + \frac{B}{\Omega} \cdot e^{-\beta \left[ B_b-B_s \right]} + \frac{A \cdot B}{\Omega^2} \cdot e^{-\beta \left[ A_b-A_s+B_b-B_s+C_{AB} \right]}$$

Taking the definition $[A] = A/\Omega$ and $\Delta G_A = A_b-A_s$ produces:

$$ Z = 1 + [A] e^{-\beta \left[ \Delta G_A \right]} + [B] e^{-\beta \left[ \Delta G_B \right]} + [A][B] e^{-\beta \left[ \Delta G_A+\Delta G_B+C_{AB} \right]}$$

Then the probability of any state is just the weight of that state (scaled by the weight of the base state) divided by the partition function expression $Z$.

From the above, we notice the form of the expression for the weight of a configuration of N TFBSs:

$$ p_{\mathrm{config}} = \prod_{i \in \, \mathrm{bound \,TBFS}} \left( [\mathrm{TF}_{\mathrm{cognate}(i)}] \cdot e^{-\beta \left[ \Delta G_i + \sum_j c_{ij} \right]} \right) / Z$$

For numerical stability, we take the log of the unnormalized probability (that is, the weight) of configurations:

$$ \log(\tilde{p}_{\mathrm{config}}) = \sum_{i \in \, \mathrm{bound \,TBFS}} \left( \log([\mathrm{TF}_{\mathrm{cognate}(i)}]) - \beta \left[ \Delta G_i + \sum_j c_{ij} \right] \right) $$

Competition

Incorporating competition into our thermodynamic model is slightly more subtle than cooperativity, because we can imagine two types of competition

  • Two TFs which may both bind DNA at adjacent sites, causing a free energy penalty due to some unfavorable interaction.
  • Two TFs with overlapping DNA binding sites, which cannot physically be bound at the same time.

In the former case, the expression for $p_\mathrm{config}$ we had written before suffices, with values of $C_{AB}$ having both signs to represent cooperativity and competition. Nominally, the latter case also fits this formalism if we allow $C_{AB}$ to reach $-\infty$, but that would cause us headaches in the implementation. Instead, the weights of all those configurations which are not physical attainable, due to "strict" competitive binding between TFs vying for overlapping binding sites, are merely omitted from the denominator $Z$.

Sample cooperativity values

In order to compute concrete probabilities of configurations, accounting for cooperativity and competition, we will need concrete energies. We'll take $C_{AB}$ to be distributed exponentially with a mean at 2.2kcal/Mol. (Forsén & Linse).

Let's check our derivation (and our implementation) of $p_\mathrm{config}$ by comparing it to our direct computation of $p_b$ from §1. Single TF.

def enumerate_configs(TFBSs): return list(map(np.array, itertools.product([0,1], repeat=len(TFBSs))))

def p_configs(TFBSs, TF_conc, cooperativities):

    configs = enumerate_configs(TFBSs)
    weights = []

    for config in configs:

        weights.append(np.exp(log_P_config(config, TFBSs=TFBSs, TF_conc=TF_conc, cooperativities=cooperativities)))

    return list(zip(configs, np.array(weights) / sum(weights)))
p_bound (prev method): 	 0.00517810045101498
p_config (new method): 	 [(array([0]), 0.9948218995489851), (array([1]), 0.005178100451014972)]

With that sanity check, let's now consider the scenario of two transcription factors with cooperative binding, and compute the probabilities of each of the 4 configurations:

Low-Affinity binding

In reality, transcription factors' binding sites are not so discretely present or absent: transcription factors may bind anywhere along the DNA polymer, with an affinity dependent on the interaction surface provided by the sequence of nucleotides at each locus. There are a variety of approaches to model the sequence-to-affinity function for each TF. The simplest is to consider each nucleotide independently, and list the preferences for each nucleotide at each sequential position in a matrix. This type of "mono-nucleotide position weight matrix" (PWM) model is commonly used, and frequently represented by a "sequence logo" plot.

PWMs: Construction

I have curated PWMs from various databases for ~1000 human TFs.

tfdb2 = pd.read_json('../../AChroMap/data/processed/TF.2022.json').fillna(value=np.nan)  ## prefer nan to None

cisbp_pwms = pd.read_pickle('../../AChroMap/data/raw/TF/CISBP_2_PWMs.pickle')
humanTFs_pwms = pd.read_pickle('../../AChroMap/data/raw/TF/HumanTFs_PWMs.pickle')
jaspar2022_pwms = pd.read_pickle('../../AChroMap/data/raw/TF/jaspar2022_PWMs.pickle')
probound_pwms = pd.read_json('../../AChroMap/data/processed/TF_ProBound.json', orient='records').T  # we only get probound as delta G scores, not as PWMs

As an example, consider the Transcription Factor SPI1 / PU.1, for which the following PWMs are registered in the CisBP 2.00 database:

SPI1_pwms = tfdb2.loc[(tfdb2.Name == 'SPI1')]

SPI1_cisbp_pwms = cisbp_pwms.loc[SPI1_pwms.CISBP_2_strict.iat[0]]
SPI1_humanTFs_pwms = humanTFs_pwms.reindex(SPI1_pwms.HumanTFs.iat[0]).dropna()
SPI1_jaspar2022_pwms = jaspar2022_pwms.loc[SPI1_pwms.JASPAR_2022.iat[0]]
SPI1_probound_pwms = probound_pwms.loc[SPI1_pwms.ProBound.iat[0]].pwm
for (name, SPI1_pwm), offset in zip(SPI1_cisbp_pwms.items(), (0,0,2,0,0,0)):
    ax = plot_pwm(SPI1_pwm, figsize=((len(SPI1_pwm)+offset+1)*0.25,1), xtick_offset=offset)
    ax.set_title(name, loc='left')

This representation of this matrix visualizes the proportion of occurrences of a base at a particular position in aligned SPI1 binding sites. Since they look similar, we may be tempted to average them to have a single model of PU.1 binding, however it is considered unwise to do so, as TFs are believed to have multiple binding "modes" which refers explicitly to modes in the distribution of sequence-binding affinities.

These PWM models can be fit from a variety of experiments measuring PU.1 binding. In this case, we're looking at models fit from High-Throughput SELEX data, but other models for PU.1 binding generated from other types of experiments are catalogued at its entry in the HumanTFs database.

PWMs: scoring sequences

If we assume a TF's PWM catalogues its relative binding preferences to every possible sequence of individual bases, we can use that PWM to evaluate the relative probabilities of TF binding to new sequences. We'll define the PWM "score" as the (log) ratio of the probability of the sequence with the PWM against random chance. The probability of a sequence under the PWM is $\prod \mathrm{PWM}[i][\mathrm{dna}[i]]$. The probability of the sequence under a background model is $\prod \mathrm{bg}(\mathrm{dna}[i])$ where $\mathrm{bg}$ describes the frequency of each base in the genome (not quite 1/4 in the human genome). We take the logarithm of the ratio of these two probabilities to produce the log-likelihood ratio, which we call our "score":

$$\mathrm{score}(\mathrm{PWM}, j) = \log(\frac{\prod \mathrm{PWM}[i][\mathrm{dna}[j+i]]}{\prod \mathrm{bg}(\mathrm{dna}[j+i])}) = \sum \log(\frac{ \mathrm{PWM}[i][\mathrm{dna}[j+i]]}{\mathrm{bg}(\mathrm{dna}[j+i])}) $$

A positive $\mathrm{score}(\mathrm{PWM}, j)$ indicates the subsequence starting at position $j$ is likelier under the PWM model than a random sequence from the human genome, and a negative score indicates the subsequence is more likely to be random. Let's visualize a PU.1 scoring matrix:

bg = [0.2955, 0.2045, 0.2045, 0.2955]
pseudocount = 0.0001

def log_odds_pwm(pwm):
    return np.log((pwm+pseudocount)/bg)
pwm = SPI1_cisbp_pwms.iloc[1]
pwm_name = SPI1_cisbp_pwms.index[0]
ax = plot_pwm(log_odds_pwm(pwm), figsize=(4,4))
_ = ax.set_ylabel('score',  weight='bold')
cisbp_score_pwms = cisbp_pwms.apply(log_odds_pwm)
humanTFs_score_pwms = humanTFs_pwms.apply(log_odds_pwm)
jaspar2022_score_pwms = jaspar2022_pwms.apply(log_odds_pwm)

SPI1_cisbp_score_pwms = cisbp_score_pwms.loc[SPI1_pwms.CISBP_2_strict.iat[0]]
SPI1_humanTFs_score_pwms = humanTFs_score_pwms.reindex(SPI1_pwms.HumanTFs.iat[0]).dropna()
SPI1_jaspar2022_score_pwms = jaspar2022_score_pwms.loc[SPI1_pwms.JASPAR_2022.iat[0]]

Our strategy will be to score genomic regions for TF binding, searching for high PWM scores starting at every offset $j$ in the supplied DNA sequence. Let's remind ourselves here that in order to find the full set of motif matches in a supplied DNA sequence, we either need to scan the reference genome and its reverse complement, or scan the positive strand with both the forward and reverse-complement PWMs. We opt for the latter approach, and therefore augment our PWM set with reverse-complement PWMs:

plot_pwm(pwm, figsize=(4,1)).set_title('Forward: '+pwm_name, loc='left')
plot_pwm(reverse_complement_pwm(pwm), figsize=(4,1)).set_title('Reverse: '+pwm_name, loc='left')
None

Since we now have sequence-dependent transcription factor binding models, we will select a sequence which harbors known binding sites for select transcription factors, and see whether the models predict binding of those transcription factors.

As a specific sequence, let's use the SPI1 promoter.

def get_reference_transcript(gene_name):

    hg38_reference_annotation = pd.read_csv('../../AChroMap/data/processed/GRCh38_V29.csv')
    reference_transcript = hg38_reference_annotation[(hg38_reference_annotation.gene_name == gene_name) & (hg38_reference_annotation.canonicity == 0)]
    chrom = reference_transcript.chr.iat[0]
    strand = reference_transcript.strand.iat[0]
    tss = (reference_transcript.txStart if all(reference_transcript.strand == 1) else reference_transcript.txEnd).iat[0]

    return chrom, strand, tss
gene_name = 'SPI1'

chrom, strand, tss = get_reference_transcript(gene_name)
start = tss-400
stop = tss+400
def get_hg38_bases_local(chrom, start, stop):
    seqkit_command = f'seqkit subseq /Users/alex/Documents/AChroMap/data/raw/genome/hg38_fa/{chrom}.fa --region {start}:{stop}'
    dna = subprocess.run(shlex.split(seqkit_command), stdout=subprocess.PIPE).stdout.decode('utf-8').split('\n',1)[1].replace('\n','')
    return dna
SPI1_promoter_dna_sequence = get_hg38_bases_local(chrom, start, stop)
print(SPI1_promoter_dna_sequence)
TTCCAGGGAGGAAACCCTGACTTCCCACTGATAGCAAGCCAGGAGGGCAGTGGGTGGGCTGGCGGGTTCGTGGGCAGGCAGGCAGGCGTCCGAGGGCCACGGGTTGGGCTGGTGGAGGAGTCCCGGTACTCACAGGGGGGACGAGGGGAAACCCTTCCATTTTGCACGCCTGTAACATCCAGCCGGGCTCCGAGTCGGTCAGATCCCCTGCCTCGGTGGGGGCCAATGCAGAGCCCCTCAGGATGGGGTGCCCCGTCAGGGGCTGGACGGTCGTGGGGCGGGTGCAGGGCTCAGGCCTGCCCCCTGAGCTACAGGAGCCCTGGGTGAGCCCCCTCCCTTGACATTGCAGGGCCAGCACAAGTTCCTGATTTTATCGAAGGGCCTGCCGCTGGGAGATAGTCCCCTTGGGGTGACATCACCGCCCCAACCCGTTTGCATAAATCTCTTGCGCTACATACAGGAAGTCTCTGGCCGGCTGGGGCAGGTGGTGCTCAAAGGGCTGGCCTGGGAAGCCATGGGGTCCAGGCCCCCTGCCCAGAGGAAGCTGGGACTGAGAGGGATGACTTTGGGGGCTAAGCTGGGGAGGGAGGATGGGAGGGAGAACGTGTAGCTCTGCCACACCACTGGGAGGCTTTTGCTCTAACCCAACAAATGCCTGCTTCTTTTGAGATCCCTATGTAGCCAACAGTCACCTCATTGGGGTCAGAGCTGGAAGGGGTGGCCTCTTGGGGCCTCCACTTTCTGGAGTCAGCCTTCCTGGGTGAGGGCTCTGATCTAGCAGGCTATCAGGCCTGGCTCTTC

SPI1 promotes its own transcription, by binding to its own promoter. We can scan the SPI1 promoter for significant SPI1 PWM scores to locate the binding site.

base_index = {'a':0,'c':1,'g':2,'t':3,'A':0,'C':1,'G':2,'T':3}

def scan_sequence(pwm, dna):
    scores = []

    for i in range(len(dna)):
        score = 0
        for j, row in enumerate(pwm):
            if i+j < len(dna):
                score += row[base_index[dna[i+j]]]

        scores.append(score)

    return np.array(scores)
SPI1_PWM_scores = pd.DataFrame(SPI1_cisbp_score_pwms.apply(scan_sequence, args=(SPI1_promoter_dna_sequence,)).values.tolist(), index=SPI1_cisbp_score_pwms.index).T
SPI1_PWM_scores.index += start

positive_SPI1_PWM_scores = pd.DataFrame(np.where(SPI1_PWM_scores > 0, SPI1_PWM_scores, np.nan), index=SPI1_PWM_scores.index, columns=SPI1_PWM_scores.columns)
def genomic_xticks(ax):
    xlims = ax.get_xlim()
    xtick_vals = ax.get_xticks()
    ax.set_xticks(xtick_vals)
    ax.set_xticklabels(['{:.0f}'.format(x) for x in xtick_vals])
    ax.set_xlim(xlims)

ax = SPI1_PWM_scores.plot(lw=0.1, xlim=(SPI1_PWM_scores.index.min(),SPI1_PWM_scores.index.max()), legend=False)
ax = positive_SPI1_PWM_scores.plot(style=".", ax=ax)
ax.set_ylabel('score')
genomic_xticks(ax)

def plot_tss(ax, tss, strand, axvline=False):
    arrow_direction = 8 if strand == -1 else 9
    ax.plot(tss, 1, lw=0, markersize=10, c='deepskyblue', marker=arrow_direction)
    ax.text(tss+1,0,gene_name)
    if axvline: ax.axvline(tss, c='deepskyblue', lw=0.5)

plot_tss(ax, tss, strand)

Now we'd like to score every PWM against every offset in the entire reference human genome. In order to accomplish that, we'll use MOODS, which implements a faster scanning algorithm in C++. We'll use this script to score the human reference genome with our PWMs.

PWMs: Score cutoffs

Once we have scanned a sequence of genomic DNA, scoring the PWM at every offset, we need to establish which of those scores is "significant" enough that we would predict the existence of a transcription factor binding site at that position in the genome. We can imagine two approaches to distinguish "significant" motif matches for each PWM. Thresholding on a particular score corresponds to predicting the TF is bound to sites which are $e^{\mathrm{score}}$ more likely under the PWM than random. A score cutoff of 6.91 is 1000x more likely under the PWM than random, for every PWM.

Contrarily, if you wanted to choose the 0.1% top scores for each PWM, each PWM would have a unique score cutoff, as each cutoff would depend on the distribution of scores specified by the PWM. At first this may seem surprising, so we'll explore it empirically for two PWMs:

pwm_names = ['M07944_2.00','M03355_2.00']
pwms = cisbp_pwms.loc[pwm_names]
score_pwms = cisbp_score_pwms.loc[pwm_names]
for name, pwm in pwms.items(): plot_pwm(pwm, figsize=(4,1)).set_title(name, loc='left')
def sample_scores_from_pwm(pwm, sample_size=int(1e6)):
    scores = np.zeros(sample_size)
    for i in range(len(pwm)):
        scores += np.random.choice(pwm[i], sample_size, p=bg)
    return pd.Series(scores)
score_samples = score_pwms.apply(sample_scores_from_pwm).T
ax = score_samples.plot.hist(bins=100, alpha=0.5)
ax.set(xlabel='score')
fig_style_2(ax)

Clearly we can see these two PWMs define different score distributions for random sequences. Let's see what probabilities the highest scores correspond to for each of these PWMs, using MOODS' score cutoff method:

import MOODS.tools

def score_thresholds(score_pwm):
    min_significance, max_significance = -2, -15
    significance_range = min_significance - max_significance
    pvals = np.logspace(min_significance, max_significance, significance_range*2+1)
    score_thresholds = [MOODS.tools.threshold_from_p(score_pwm.T.tolist(), bg, p) for p in pvals]
    return pd.Series(score_thresholds, index=pvals)
score_quantiles = score_pwms.apply(score_thresholds).T.rename_axis('p').reset_index()

fig, ax = plt.subplots()

cols = score_quantiles.columns[1:]
for col in cols:
    score_quantiles.plot(x=col, y='p', logy=True, legend=False, ax=ax)

ax.set(ylabel='p_value', xlabel='score')
ax.grid(True, ls=':')
for side in ["right","top"]: ax.spines[side].set_visible(False)
ax.legend(cols)
ax.axhline(1e-3, c='red')
ax.text(0,2e-3, 'p=0.001')
ax.plot(score_quantiles.iloc[2][['M07944_2.00','M03355_2.00']],(1e-3, 1e-3), marker='o', lw=0, c='red')
None
# cisbp_score_quantiles.to_csv('../data/equilibrium_TFs/cisbp_score_quantiles.csv', index=False)
cisbp_score_quantiles = pd.read_csv('../data/equilibrium_TFs/cisbp_score_quantiles.csv')
s = interpolate.InterpolatedUnivariateSpline(x, y)
----------------------------------------------------------------------
NameError                            Traceback (most recent call last)
<ipython-input-43-553df98bb994> in <module>
----> 1 s = interpolate.InterpolatedUnivariateSpline(x, y)

NameError: name 'x' is not defined

PWMs: Other formalisms

 

Unfortunately, the PWM formalism we have introduced misses the biophysical interpretation we have carried throughout our analysis thus far. We can coax a biophysical interpretation from these PWMs as follows (following the approach from Foat, Morozov, Bussemaker, 2006):

If we consider the nucleotides independently, then the ratio of equilibrium binding to a given nucleotide $j$ versus the highest-affinity nucleotide $\mathrm{best}$ at each position is related to the difference in energies of binding:

$$ \frac{[TF:N_j]}{[TF:N_{best}]} = \frac{K_{a_j}}{K_{a_{best}}} = \exp \{\Delta G_j - \Delta G_{best} \} $$

With this expression, we can compute the energy of binding of a TF to a sequence relative to that TF's energy of binding to its best sequence of nucleotides. However, we need to know an absolute energy of binding to anchor this relative measure.

tfdb['energy_pwm'] = tfdb.pwm.apply(make_energy_pwm)
----------------------------------------------------------------------
NameError                            Traceback (most recent call last)
<ipython-input-47-f68e5373e483> in <module>
----> 1 tfdb['energy_pwm'] = tfdb.pwm.apply(make_energy_pwm)

NameError: name 'tfdb' is not defined

We can visualize this "energy PWM" as well. Visualizing energies of particular bases relative to SPI1's best sequence does not produce a visual representation that's easy to reason about:

ax = plot_energy_pwm(tfdb.loc['SPI1'].energy_pwm)
----------------------------------------------------------------------
NameError                            Traceback (most recent call last)
<ipython-input-48-3abec1402bd4> in <module>
----> 1 ax = plot_energy_pwm(tfdb.loc['SPI1'].energy_pwm)

NameError: name 'plot_energy_pwm' is not defined

We can rescale this matrix to the average sequence:

SPI1_energy_pwm = tfdb.loc['SPI1'].energy_pwm

ax = plot_energy_pwm(SPI1_energy_pwm - np.expand_dims(SPI1_energy_pwm.mean(axis=1), axis=1))

However, the ideal rescaling of this PWM would indicate whether each nucleotide contributes positively or negatively towards the energy of binding. After all, SPI1 likely doesn't bind the "average" sequence -- which is to say it's off rate is higher than it's on rate.

We chose SPI1 / PU.1 as our prototype in part because its absolute binding energies to various sequences have been measured, allowing us to anchor our relative energies PWM model. Pham et al measured PU.1 binding to a variety of high-affinity sequences by microscale thermophoresis and recorded a $k_d$ of 156nm for the best one.

SPI1_best_nanomolar_kd = 156
SPI1_best_ΔG = kcal_ΔG_from_nanomolar_kd(SPI1_best_nanomolar_kd)
SPI1_best_ΔG

However, the question remains of how to allocate those ~9.6 kcal/mol across the 14 bases of the SPI1 motif. The naive option is to uniformly spread the binding energy across the bases (9.6/14). Instead, we'll distribute it proportionally to the variance at each position to capture the intuition that the flanking nucleotides contribute less.

SPI1_energy_pwm_scaled = make_abs_pwm(SPI1_energy_pwm, SPI1_best_ΔG)
plot_energy_pwm(SPI1_energy_pwm_scaled)

At last we have a visual representation of of a mono-nucleotide sequence preference model which captures our intuitions about the energetics of TF binding.

For a restricted set of TFs with high-quality HT-SELEX data, Rube et al produced relative energy PWMs with higher accuracy in the low-affinity regime. SPI1 is among that privileged set, so we'll take a look at that PWM for comparison.

ProBound_PWMs = pd.read_json('../../AChroMap/data/processed/TF_ProBound.json', orient='index')

probound_SPI1_pwms_scaled = ProBound_PWMs[ProBound_PWMs.TF_name == 'SPI1'].pwm.apply(make_abs_pwm, args=(SPI1_best_ΔG,))

ax = plot_energy_pwm(probound_SPI1_pwms_scaled.iloc[1])
tfdb['abs_pwm'] = tfdb.energy_pwm.apply(make_abs_pwm, args=(10,))
probound_SPI1_pwm_scaled = probound_SPI1_pwms_scaled.iloc[1]

Describe "The Experiment"

TFs_with_motifs = tfdb2[tfdb2.set_index('Name')[['HumanTFs','CISBP_2_strict','CISBP_2_intermediate','CISBP_2_tolerant','JASPAR_2022','ProBound']].notnull().any(axis=1).values].Name

venn3((set(tfdb2.Name), set(TFs_with_motifs), set(all_chip_able_targets)), ('DNA-binding TF', 'DNA-binding TF with motif', 'ChIP in K562'))
----------------------------------------------------------------------
NameError                            Traceback (most recent call last)
<ipython-input-49-d1fa4dd39727> in <module>
      1 TFs_with_motifs = tfdb2[tfdb2.set_index('Name')[['HumanTFs','CISBP_2_strict','CISBP_2_intermediate','CISBP_2_tolerant','JASPAR_2022','ProBound']].notnull().any(axis=1).values].Name
      2 
----> 3 venn3((set(tfdb2.Name), set(TFs_with_motifs), set(all_chip_able_targets)), ('DNA-binding TF', 'DNA-binding TF with motif', 'ChIP in K562'))

NameError: name 'all_chip_able_targets' is not defined
 
cisbp_strict_tfbs = process_MOODS_scan_results(cisbp_motif_matches, chrom, start, stop)
----------------------------------------------------------------------
NameError                            Traceback (most recent call last)
<ipython-input-51-27b4ba67253b> in <module>
----> 1 cisbp_strict_tfbs = process_MOODS_scan_results(cisbp_motif_matches, chrom, start, stop)

NameError: name 'process_MOODS_scan_results' is not defined
TFs_covered_by_CISBP_strict = tfdb2[tfdb2.CISBP_2_strict.notnull()].Name
venn2((set(all_chip_able_targets), set(TFs_covered_by_CISBP_strict)), ('all_chip_able_targets', 'TFs_covered_by_CISBP_strict'))
----------------------------------------------------------------------
NameError                            Traceback (most recent call last)
<ipython-input-52-0353085c93f2> in <module>
      1 TFs_covered_by_CISBP_strict = tfdb2[tfdb2.CISBP_2_strict.notnull()].Name
----> 2 venn2((set(all_chip_able_targets), set(TFs_covered_by_CISBP_strict)), ('all_chip_able_targets', 'TFs_covered_by_CISBP_strict'))

NameError: name 'all_chip_able_targets' is not defined
ChIPable_CISBPable_TFs = set(all_chip_able_targets) & set(TFs_covered_by_CISBP_strict)

Get ATAC peaks / promoter/enchaner regions

narrowpeak_cols = ['chrom','chromStart','chromEnd','name','score','strand','signalValue','pValue','qValue','peak']

ATAC_peaks_files = ['/Users/alex/Documents/AChroMap/data/raw/ENCODE/k562_atac/downloads/ENCFF333TAT.bed',
                    '/Users/alex/Documents/AChroMap/data/raw/ENCODE/k562_atac/downloads/ENCFF558BLC.bed']
def filter_overlapping_peaks_for_max_signalValue(filepath, fraction_either=0.8):

    narrowpeaks_unfiltered = pd.read_csv(f'{filepath}', sep='\t', na_values=['NAN'], header=None, names=narrowpeak_cols)

    bedops_command = f"bedops -u {filepath} | cut -f1-4,7 | bedmap --fraction-either {fraction_either} --max-element - | sort-bed --unique - > {filepath}.tmp"
    os.system(bedops_command)
    narrowPeaks_filtered_peak_names = pd.read_csv(f'{filepath}.tmp', sep='\t', na_values=['NAN'], header=None, names=narrowpeak_cols[:5]).name
    os.remove(f'{filepath}.tmp')

    return narrowpeaks_unfiltered[narrowpeaks_unfiltered.name.isin(narrowPeaks_filtered_peak_names)]
K562_ATAC_peaks_1_filtered = filter_overlapping_peaks_for_max_signalValue(ATAC_peaks_files[0])
K562_ATAC_peaks_2_filtered = filter_overlapping_peaks_for_max_signalValue(ATAC_peaks_files[1])

K562_ATAC_peaks_1_filtered['signalValue_percentile'] = K562_ATAC_peaks_1_filtered.signalValue.rank(pct = True)
K562_ATAC_peaks_2_filtered['signalValue_percentile'] = K562_ATAC_peaks_2_filtered.signalValue.rank(pct = True)
def get_overlapping_peaks(peaks_bed, chrom, start, stop):

    intervals = pd.IntervalIndex.from_arrays(peaks_bed.chromStart, peaks_bed.chromEnd)

    return peaks_bed[(peaks_bed.chrom == chrom) & intervals.overlaps(pd.Interval(start, stop))]
atac_peak_at_SPI1_promoter_1 = get_overlapping_peaks(K562_ATAC_peaks_1_filtered, chrom, start, stop)
atac_peak_at_SPI1_promoter_2 = get_overlapping_peaks(K562_ATAC_peaks_2_filtered, chrom, start, stop)

explain motivation: want to look in all ATAC regions for PWMs and chip signal in k562s

from dna_features_viewer import GraphicFeature, GraphicRecord

earliest_start = min(start, atac_peak_at_SPI1_promoter_1.chromStart.iat[0], atac_peak_at_SPI1_promoter_2.chromStart.iat[0])
latest_end = max(stop, atac_peak_at_SPI1_promoter_1.chromEnd.iat[0], atac_peak_at_SPI1_promoter_2.chromEnd.iat[0])

plot_dna_length = (latest_end - earliest_start)*1.1
offset = earliest_start - plot_dna_length*0.05

features = [
    GraphicFeature(start=start, end=stop, color="#ffd700", label="Designated as promoter"),
    GraphicFeature(start=tss-30, end=tss, strand=-1, color="#ffcccc", label="tss"),
    GraphicFeature(start=atac_peak_at_SPI1_promoter_1.chromStart.iat[0], end=atac_peak_at_SPI1_promoter_1.chromEnd.iat[0], color="#cffccc", label="atac_peak_at_SPI1_promoter_1"),
    GraphicFeature(start=atac_peak_at_SPI1_promoter_2.chromStart.iat[0], end=atac_peak_at_SPI1_promoter_2.chromEnd.iat[0], color="#ccccff", label="atac_peak_at_SPI1_promoter_2")
]
record = GraphicRecord(features=features, sequence_length=latest_end*1.2)
cropped_record = record.crop((offset, offset+plot_dna_length))

ax, rest = cropped_record.plot(figure_width=5)
plt.setp(ax.get_xticklabels(), fontsize=7)
None
ATAC_bigwig_1 = get_bigwig_values('../../AChroMap/data/raw/ENCODE/k562_atac/downloads/ENCFF357GNC.bigWig', chrom, start, stop)
ATAC_bigwig_2 = get_bigwig_values('../../AChroMap/data/raw/ENCODE/k562_atac/downloads/ENCFF600FDO.bigWig', chrom, start, stop)
----------------------------------------------------------------------
NameError                            Traceback (most recent call last)
<ipython-input-59-952f2f7e4b8b> in <module>
----> 1 ATAC_bigwig_1 = get_bigwig_values('../../AChroMap/data/raw/ENCODE/k562_atac/downloads/ENCFF357GNC.bigWig', chrom, start, stop)
      2 ATAC_bigwig_2 = get_bigwig_values('../../AChroMap/data/raw/ENCODE/k562_atac/downloads/ENCFF600FDO.bigWig', chrom, start, stop)

NameError: name 'get_bigwig_values' is not defined
ATAC_bigwig = pd.DataFrame([ATAC_bigwig_1, ATAC_bigwig_2], columns=range(start, stop), index=['ATAC_1', 'ATAC_2']).T
ATAC_bigwig
import pyBigWig

out = []

def get_bigbed_values(url, chrom, start, stop):
    with pyBigWig.open(url) as bw:
        for chrom, start, stop in intervals:
            out.append(bw.entries(chrom, start, stop))

    return out
K562_ATAC_peaks_1_filtered[K562_ATAC_peaks_1_filtered.score == 1000][['chrom','chromStart','chromEnd']].to_csv('~/Desktop/K562_ATAC_peaks_1_filtered_1000_intervals.csv', header=False, index=False)
K562_ATAC_peaks_2_filtered[K562_ATAC_peaks_2_filtered.score == 1000][['chrom','chromStart','chromEnd']].to_csv('~/Desktop/K562_ATAC_peaks_2_filtered_1000_intervals.csv', header=False, index=False)

Get all data about a set of regions

intervals = [(chrom, start, stop)]
intervals
[('chr11', 47378176, 47378976)]
SPI1_promoter_TF_ChIP_signal = pd.read_csv('../data/equilibrium_TFs/SPI1_promoter_TF_ChIP_signal.csv', index_col=0)
# tfbs
# dgf
# atac signal
atac_peak_at_SPI1_promoter_1
chrom chromStart chromEnd name score strand signalValue pValue qValue peak signalValue_percentile
47943 chr11 47376625 47379060 Peak_11701 1000 . 8.32616 1151.26013 1148.02783 1947 0.774715
import pickle

with open('../../AChroMap/scripts/cisbp2_vs_chip.pickle') as pickle_file:
    cisbp2_vs_chip = pickle.load(pickle_file)

os.system("printf '\a'") # or '\7'
----------------------------------------------------------------------
UnicodeDecodeError                   Traceback (most recent call last)
<ipython-input-66-89fc4cb1637e> in <module>
      2 
      3 with open('../../AChroMap/scripts/cisbp2_vs_chip.pickle') as pickle_file:
----> 4     cisbp2_vs_chip = pickle.load(pickle_file)
      5 
      6 os.system("printf '\a'") # or '\7'

/usr/local/Cellar/python@3.7/3.7.10_3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Load K562 RNA

ENCODE_RNA_metadata = pd.read_csv('../../AChroMap/data/raw/ENCODE/rna/metadata.clean.typed.csv')
K562_RNA_files = ENCODE_RNA_metadata[(ENCODE_RNA_metadata['Biosample term name'] == 'K562') & (ENCODE_RNA_metadata.type == 'RSEM_gene')].file.values
K562_RNA = pd.read_csv('../../AChroMap/data/raw/ENCODE/rna/RSEM_gene/RSEM_gene.counts_matrix.wellexpressed_deseq.csv', index_col=0)[K562_RNA_files]
K562_RNA = K562_RNA[K562_RNA.index.str.startswith('ENSG')]
K562_RNA = K562_RNA / K562_RNA.sum() * 360000
 
gene_id_to_name = pd.read_csv('../../AChroMap/data/processed/GRCh38_V29.csv')[['gene_name','gene_id']].drop_duplicates().set_index('gene_id')
gene_id_to_name = gene_id_to_name.reset_index().drop_duplicates(subset=['gene_name']).set_index('gene_name')
tf_gene_ids = gene_id_to_name.reindex(tfdb.index)
outdated_TF_names = {
    'ZNF875': 'HKR1',
    'ZFTA': 'C11orf95',
    'ZNF705EP': 'ZNF705E',
    'CENPBD1P': 'CENPBD1',
    # 'DUX1': '',
    # 'DUX3': '',
    # 'ZZZ3': ''
}
for current_name, old_name in outdated_TF_names.items():
    tf_gene_ids.loc[current_name, 'gene_id'] = gene_id_to_name.at[old_name, 'gene_id']

tf_gene_ids = tf_gene_ids.dropna()
K562_TF_RNA = K562_RNA.loc[tf_gene_ids.gene_id]
K562_TF_RNA.index = tf_gene_ids.index
K562_TF_RNA.to_csv('../data/equilibrium_TFs/K562_TF_RNA.csv')
K562_TF_RNA = K562_TF_RNA.replace(0, np.nan).median(axis=1).replace(np.nan, 0)
RNA_to_protein_linear_scaling_factor = 9800
K562_TF_copy_number = K562_TF_RNA * RNA_to_protein_linear_scaling_factor
K562_TF_copy_number.plot.hist(bins=100, logy=True)
<AxesSubplot:ylabel='Frequency'>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-04-07T10:28:56.889260 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/
K562_TF_nanomolar_conc = K562_TF_copy_number.apply(copy_number_and_cubic_micron_volume_to_molar_concentration, args=(K562_volume_micron_cubed,)) * 1e9
K562_TF_nanomolar_conc = K562_TF_nanomolar_conc.rename('nanomolar_conc')

Get DGF

motif_cluster_and_gene = pd.read_csv('/Users/alex/Documents/AChroMap/data/raw/Vierstra_supp/motif_cluster_and_gene.csv', index_col=0)
motif_clusters_to_TFs = pd.read_csv('/Users/alex/Documents/AChroMap/data/raw/Vierstra_supp/motif_clusters_to_TFs.csv', index_col=0)
motif_clusters_to_TFs = motif_clusters_to_TFs.set_index('Name')['gene'].str.split(';')

DGF_footprint_index = pd.read_csv('/Users/alex/Documents/AChroMap/data/raw/Vierstra_supp/consensus_index_matrix_full_hg38.index.csv', index_col=0, header=None).index
DGF_samples_index = pd.read_csv('/Users/alex/Documents/AChroMap/data/raw/Vierstra_supp/consensus_index_matrix_full_hg38.columns.csv', index_col=0, header=None).index
K562_DGF_samples = DGF_samples_index[DGF_samples_index.str.startswith('K562')].tolist()
def get_k562_DGF(chrom, start, end):

    dgf_path = '/Users/alex/Documents/AChroMap/data/raw/Vierstra_supp/consensus_footprints_and_collapsed_motifs_hg38.bed.gz'
    tabix_coords = chrom+':'+str(start)+'-'+str(end)
    cols = ['contig','start','stop','identifier','mean_signal','num_samples','num_fps','width','summit_pos','core_start','core_end','motif_clusters']
    tabix_command = f'tabix {dgf_path} {tabix_coords}'
    footprints_contained_in_region = read_shell(shlex.split(tabix_command), sep='\t', header=None, names=cols, dtype={'index_id': str})

    footprints_contained_in_region['TFs'] = footprints_contained_in_region.motif_clusters.str.split(';').apply(lambda l: [motif_clusters_to_TFs[motif_cluster] for motif_cluster in l] if (type(l) == list) else [])
    first_dgf_index = DGF_footprint_index.get_loc(footprints_contained_in_region.identifier.iloc[0])
    last_dgf_index = DGF_footprint_index.get_loc(footprints_contained_in_region.identifier.iloc[-1])+1

    footprint_posterior = pd.read_hdf('/Users/alex/Documents/AChroMap/data/raw/Vierstra_supp/consensus_index_matrix_full_hg38.hdf', key='consensus_index_matrix_full_hg38',
                                      start=first_dgf_index, stop=last_dgf_index, columns=K562_DGF_samples)

    footprint_binary = pd.read_hdf('/Users/alex/Documents/AChroMap/data/raw/Vierstra_supp/consensus_index_matrix_binary_hg38.hdf', key='consensus_index_matrix_full_hg38',
                                   start=first_dgf_index, stop=last_dgf_index, columns=K562_DGF_samples)

    k562_dgf = footprints_contained_in_region.merge(footprint_posterior, how='left', left_on='identifier', right_index=True)
    k562_dgf = k562_dgf.merge(footprint_binary, how='left', left_on='identifier', right_index=True, suffixes=['_posterior','_binary'])

    return k562_dgf
k562_dgf = get_k562_DGF(chrom, start, stop)
k562_dgf.plot.scatter(x='K562-DS15363_posterior', y='K562-DS16924_posterior', figsize=(4,4), logx=True, logy=True)
<AxesSubplot:xlabel='K562-DS15363_posterior', ylabel='K562-DS16924_posterior'>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-04-07T10:29:01.235248 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/

Get ChIP

def get_K562_ChIP(chrom, start, end):

    k562_merged_ChIP_path = '/Users/alex/Documents/AChroMap/data/raw/ENCODE/k562_chip/K562_ChIP_merged.bed.gz'
    tabix_coords = chrom+':'+str(start)+'-'+str(end)
    tabix_command = f'tabix {k562_merged_ChIP_path} {tabix_coords}'
    peaks_contained_in_region = read_shell(shlex.split(tabix_command), sep='\t', header=None, names=['chrom','chromStart','chromEnd','name','signalValue'])

    return peaks_contained_in_region
k562_chip = get_K562_ChIP(chrom, start, stop)
k562_chip
chrom chromStart chromEnd name signalValue
0 chr11 47377411 47379111 RBFOX2 310.05267
1 chr11 47377645 47378235 NFRKB 47.98813
2 chr11 47377730 47378190 ZNF639 39.63035
3 chr11 47377781 47378485 BRD4 43.30838
4 chr11 47377803 47378483 ZNF175 16.40253
... ... ... ... ... ...
202 chr11 47378666 47379076 UBTF 8.42097
203 chr11 47378699 47379035 ERF 50.17644
204 chr11 47378709 47379045 SHOX2 42.63161
205 chr11 47378832 47379248 PRDM10 53.45506
206 chr11 47378853 47379277 CBFB 39.00836

207 rows × 5 columns

all_chip_able_targets = pd.read_csv('/Users/alex/Documents/AChroMap/data/raw/ENCODE/k562_chip/metadata.cleaned.tsv', sep='\t').Target

Scatterplots

jaspar_chipable_tfbs = jaspar_tfbs[jaspar_tfbs.TFs.isin(ChIPable_JASPARable_TFs)]
k562_chip_overlapping_jaspar = k562_chip[k562_chip.name.isin(ChIPable_JASPARable_TFs)]

max_chip = k562_chip_overlapping_jaspar.groupby('name')['signalValue'].max().reindex(ChIPable_JASPARable_TFs).fillna(0).rename('max_chip')
max_tfbs_score = jaspar_chipable_tfbs.groupby('TFs')['score'].max().reindex(ChIPable_JASPARable_TFs).fillna(0).rename('max_tfbs_score')

df = pd.concat((max_chip, max_tfbs_score), axis=1)
TFs_with_SPI1_promoter_chip_signal = set(df[df.max_chip > 0].index)
TFs_with_SPI1_promoter_tfbs = set(df[df.max_tfbs_score > 0].index)

plt.rcParams['figure.figsize'] = [12, 5]

venn3((set(ChIPable_JASPARable_TFs), TFs_with_SPI1_promoter_tfbs, TFs_with_SPI1_promoter_chip_signal),
      (f'All TFs we have PWMs and ChIP data for ({len(ChIPable_JASPARable_TFs)})',
       f'Has non-zero PWM score at locus ({len(TFs_with_SPI1_promoter_tfbs)})',
       f'Has non-zero ChIP signal at locus ({len(TFs_with_SPI1_promoter_chip_signal)})'))
<matplotlib_venn._common.VennDiagram at 0x1b3c5ced0>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-04-07T10:29:47.070592 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/
 
df['c'] = np.nan

df.loc[(df.max_chip > 0) & (df.max_tfbs_score > 0), 'c'] = 'g'
df.loc[(df.max_chip > 0) & (df.max_tfbs_score == 0), 'c'] = 'm'
df.loc[(df.max_chip == 0) & (df.max_tfbs_score > 0), 'c'] = 'r'
df.loc[(df.max_chip == 0) & (df.max_tfbs_score == 0), 'c'] = 'c'
ax = df.plot.scatter(x='max_tfbs_score', y='max_chip', s=30, c='c', alpha=0.3)
# for name, point in df.iterrows():
#     ax.annotate(name, (point.max_tfbs_score, point.max_chip), fontsize=6)
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-04-07T10:29:47.310842 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/
spearman_correlation = df[['max_chip','max_tfbs_score']].corr(method='spearman').loc['max_chip','max_tfbs_score']
spearman_correlation
0.287534225542192
grid = sns.jointplot(data=df, x='max_tfbs_score', y='max_chip', hue="c", ylim=(-10, 2500), xlim=(-2, 700), alpha=0.5, legend=False)
grid.fig.set_figwidth(10)
grid.fig.set_figheight(5)
/usr/local/lib/python3.7/site-packages/seaborn/distributions.py:305: UserWarning:

Dataset has 0 variance; skipping density estimate.

/usr/local/lib/python3.7/site-packages/seaborn/distributions.py:305: UserWarning:

Dataset has 0 variance; skipping density estimate.

<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-04-07T10:29:47.595077 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/
 
def plot_distribs(no, yes, xlabel):

    fig, (ax0, ax1) = plt.subplots(1, 2)

    ax0 = no.plot.kde(ax=ax0, xlim=(yes.min(), yes.max()), legend=True)
    yes.plot.kde(ax=ax0, legend=True)

    ax0.set_xlabel(xlabel)

    no_x = no.sort_values()
    no_cdf = no_x.rank(method='average', pct=True)

    yes_x = yes.sort_values()
    yes_cdf = yes_x.rank(method='average', pct=True)

    ax1.plot([0]+no_x.tolist(), [0]+no_cdf.tolist(), label=no.name)
    ax1.plot([0]+yes_x.tolist(), [0]+yes_cdf.tolist(), label=yes.name)
    ax1.legend()
    ax1.set_ylim(0, 1)
    ax1.set_ylabel('CDF')
    ax1.set_xlabel(xlabel)

    return ax0, ax1
no = df[df.max_tfbs_score == 0].max_chip.rename('No TFBS')
yes = df[df.max_tfbs_score > 0].max_chip.rename('TFBS')
ax0, ax1 = plot_distribs(no, yes, 'ChIP signal')
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-04-07T10:29:47.874059 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/
no = df[df.max_chip == 0].max_tfbs_score.rename('No ChIP')
yes = df[df.max_chip > 0].max_tfbs_score.rename('ChIP')

ax0, ax1 = plot_distribs(no, yes, 'TFBS signal')
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-04-07T10:29:48.132447 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/
def PPV_TPR_FPR(df, thresh_index, tfbs='max_tfbs_score', chip='max_chip'):

    df.sort_values(tfbs, inplace=True)
    negative_prediction = df.iloc[:thresh_index]
    positive_prediction = df.iloc[thresh_index:]

    true_negative = sum(negative_prediction[chip] == 0)
    false_negative = sum(negative_prediction[chip] > 0)

    false_positive = sum(positive_prediction[chip] == 0)
    true_positive = sum(positive_prediction[chip] > 0)

    positive_predictive_value = (true_positive / (true_positive + false_positive)) if (true_positive + false_positive) > 0 else 1 # also called precision
    true_positive_rate = (true_positive / (true_positive + false_negative)) if (true_positive + false_negative) > 0 else 1        # also called recall
    false_positive_rate = (false_positive / (false_positive + true_negative)) if (false_positive + true_negative) > 0 else 1

    return positive_predictive_value, true_positive_rate, false_positive_rate

def rolling_PPV_TPR_FPR(df, tfbs='max_tfbs_score', chip='max_chip'):

    df.sort_values(tfbs, inplace=True)
    indices_of_thresholds = np.where(~df[tfbs].duplicated())[0]

    df = pd.DataFrame([PPV_TPR_FPR(df, thresh_index, tfbs=tfbs, chip=chip) for thresh_index in indices_of_thresholds], columns=['positive_predictive_value', 'true_positive_rate', 'false_positive_rate'])[::-1]

    df['f1'] = 2 * (df.positive_predictive_value * df.true_positive_rate) / (df.positive_predictive_value + df.true_positive_rate)

    return df

def AUROC(stats):

    auroc = np.trapz(stats.true_positive_rate, stats.false_positive_rate)
    return auroc

def plot_AUROC(stats):

    auroc = AUROC(stats)

    ax = stats.plot.line(y='true_positive_rate', x='false_positive_rate', legend=False, title=f'AUROC: {round(auroc, 3)}', xlim=(0, 1), ylim=(0, 1))
    ax.plot((0,1),(0,1), c='r', linestyle='dotted')

    ax.set_aspect('equal', adjustable='box')
    ax.set_ylabel('true_positive_rate')

    return ax, auroc
ax, auroc = plot_AUROC(rolling_PPV_TPR_FPR(df))
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-04-07T10:29:48.329299 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/

3D Ridgeline Plot

SPI1_promoter_TF_ChIP_signal_clipped = SPI1_promoter_TF_ChIP_signal.clip(lower=1, upper=10).replace(1, np.nan)
signal_df = SPI1_promoter_TF_ChIP_signal.loc[:, SPI1_promoter_TF_ChIP_signal.columns.str.split('_').str[0].isin(ChIPable_JASPARable_TFs)]
TF_order = signal_df.max(axis=0).sort_values().index.str.split('_').str[0].drop_duplicates(keep='first')
column_order = pd.Series({col: (TF_order.get_loc(col.split('_')[0]), int(col.split('_')[1])) for col in signal_df.columns}).sort_values().index
signal_df = signal_df[column_order]
signal_df = signal_df.iloc[:, -20:]
columns_to_TF = signal_df.columns.to_frame()[0].str.split('_').str[0].rename('TF').rename_axis('track').reset_index().reset_index().set_index('track')
TF_to_columns = columns_to_TF.reset_index().set_index('TF')
plotted_TFs = set(TF_to_columns.index)
genome_linspace = np.transpose(np.tile(signal_df.index.values, (len(signal_df.T), 1)))
print('genome_linspace:', genome_linspace.shape)
signal_heights = signal_df.values
print('signal_heights:\t', signal_heights.shape)
TF_y_index = np.linspace(0,len(signal_df.columns)-1,len(signal_df.columns))
print('TF_y_index: \t', TF_y_index.shape)
genome_linspace: (800, 20)
signal_heights:	 (800, 20)
TF_y_index: 	 (20,)
signal_verts = []
for y_i in range(len(TF_y_index)):
    # zero at the beginning and the end to get a nice flat bottom on the polygons
    xs = np.concatenate([[genome_linspace[0,y_i]], genome_linspace[:,y_i], [genome_linspace[-1,y_i]]])
    ys = np.concatenate([[0],signal_heights[:,y_i],[0]])
    signal_verts.append(list(zip(xs, ys)))
from matplotlib.colors import LinearSegmentedColormap
colors = LinearSegmentedColormap('colormap', cm.jet._segmentdata.copy(), 20)
colors = [colors(i) for i in range(20)]
peak_verts = []
peak_offsets = []

for col, d in columns_to_TF.iterrows():
    for i, peak in k562_chip[k562_chip['name'] == d.TF].iterrows():
        xs = [peak.chromStart, peak.chromStart, peak.chromEnd, peak.chromEnd]
        ys = [0, peak.signalValue/20, peak.signalValue/20, 0]
        peak_verts.append(list(zip(xs, ys)))
        peak_offsets.append(d['index'])
tfbs_verts = []
tfbs_offsets = []

for col, d in columns_to_TF.iterrows():
    for i, tfbs in jaspar_tfbs[jaspar_tfbs['TFs'] == d.TF].iterrows():
        xs = [tfbs.chromStart, tfbs.chromStart, tfbs.chromEnd, tfbs.chromEnd]
        ys = [0, tfbs.score/50, tfbs.score/50, 0]
        tfbs_verts.append(list(zip(xs, ys)))
        tfbs_offsets.append(d['index'])
from mpl_toolkits.mplot3d import Axes3D

plt.rcParams['figure.figsize'] = [10, len(signal_df.columns)]

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

signal_poly = PolyCollection(signal_verts, facecolors=colors)
signal_poly.set_alpha(0.7)
ax.add_collection3d(signal_poly, zs=TF_y_index, zdir='y')

peak_poly = PolyCollection(peak_verts, edgecolors=(0,0,0,0.3), facecolors=(0,0,0,0))
# peak_poly.set_alpha(0.4)
ax.add_collection3d(peak_poly, zs=peak_offsets, zdir='y')

tfbs_poly = PolyCollection(tfbs_verts)
tfbs_poly.set_alpha(1)
ax.add_collection3d(tfbs_poly, zs=tfbs_offsets, zdir='y')




ax.set_xlim3d(genome_linspace.min(), genome_linspace.max())
ax.set_xlabel(chrom)
ax.set_ylim3d(TF_y_index.min(), TF_y_index.max())
ax.set_ylabel('TF')
ax.set_zlim3d(signal_heights.min(), signal_heights.max())
ax.set_zlabel('signal fold change over background')

xtick_vals = ax.get_xticks()
ax.set_xticks(xtick_vals)
ax.set_xticklabels(['{:.0f}'.format(x) for x in xtick_vals])

ax.set_yticks(TF_y_index)
ax.set_yticklabels(signal_df.columns)


ax.xaxis.pane.fill = False
ax.yaxis.pane.fill = False
ax.zaxis.pane.fill = False
ax.xaxis.pane.set_edgecolor('w')
ax.yaxis.pane.set_edgecolor('w')
ax.zaxis.pane.set_edgecolor('w')
ax.grid(False)


ax.get_proj = lambda: np.dot(Axes3D.get_proj(ax), np.diag([1.0, 1.5, 1.0, 1.0]))


plt.show()
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-04-07T10:29:48.591149 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/

Heatmap

plt.rcParams['figure.figsize'] = [12, len(signal_df.columns)]

axs = signal_df.plot.area(subplots=True, ylim=(1, 20), xlim=(start, stop))

axs_dd = defaultdict(list)
for uid, ax in zip(signal_df.columns, axs):
    axs_dd[uid.split('_')[0]].append(ax)

for index, peak in k562_chip_overlapping_jaspar.iterrows():
    if peak['name'] in axs_dd.keys():
        for ax in axs_dd[peak['name']]:
            ax.add_patch(Rectangle((peak.chromStart, 1), peak.chromEnd-peak.chromStart, peak.signalValue/10, fc=(1,0,0,0.5), ec=(0,0,0,1), zorder=-1))

for index, tfbs in jaspar_tfbs.iterrows():
    if tfbs['TFs'] in axs_dd.keys():
        for ax in axs_dd[tfbs['TFs']]:
            ax.add_patch(Rectangle((tfbs.chromStart, 1), tfbs.chromEnd-tfbs.chromStart, tfbs.score/50, fc=(0,1,0,0.5), ec=(0,0,0,1), zorder=10))


current_axis = plt.gca()
current_values = current_axis.get_xticks()
current_axis.set_xticks(current_values)
current_axis.set_xticklabels(['{:.0f}'.format(x) for x in current_values])
current_axis.set_xlim(start, stop)
current_axis.set_xlabel(chrom)

None
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-04-07T10:29:50.021194 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/
# axs = SPI1_promoter_TF_ChIP_signal.plot.area(subplots=True, ylim=(1, 10), xlim=(start, stop))

# current_axis = plt.gca()
# current_values = current_axis.get_xticks()
# current_axis.set_xticks(current_values)
# current_axis.set_xticklabels(['{:.0f}'.format(x) for x in current_values])
# current_axis.set_xlim(start, stop)
# current_axis.set_xlabel(chrom)

# plt.savefig('/Users/alex/Desktop/SPI1_promoter_K562_TF_ChIP.svg')

Consider RNA too

# jaspar_chipable_tfbs
df2 = df.merge(K562_TF_nanomolar_conc.rename('predicted_TF_conc'), how='left', left_index=True, right_index=True)
df2['tfbs_x_tf'] = df2.max_tfbs_score * df2.predicted_TF_conc
df2.tfbs_x_tf
PKNOX1          0.000000
ZNF140          0.000000
USF2            0.000000
BACH1           0.000000
IRF1            0.000000
               ...      
JUND        55152.203936
MAZ        237096.533761
CREM        27088.520878
CREB1       44948.989942
ZKSCAN3     33654.491456
Name: tfbs_x_tf, Length: 172, dtype: float64
spearman_correlation = df2[['max_chip','tfbs_x_tf']].corr(method='spearman').loc['max_chip','tfbs_x_tf']
spearman_correlation
0.3061164936707757
grid = sns.jointplot(data=df2, x='tfbs_x_tf', y='max_chip', ylim=(-1e1, 1e3), xlim=(-1e2, 1e5), hue="c", alpha=0.5, legend=False)
grid.fig.set_figwidth(10)
grid.fig.set_figheight(5)
/usr/local/lib/python3.7/site-packages/seaborn/distributions.py:305: UserWarning:

Dataset has 0 variance; skipping density estimate.

/usr/local/lib/python3.7/site-packages/seaborn/distributions.py:305: UserWarning:

Dataset has 0 variance; skipping density estimate.

<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-04-07T10:29:50.585826 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/

All ATAC regions

import pickle5 as pickle

def _str(chrom, start, end): return f'{chrom}:{start}-{end}'

def intervals_from_bedlike(intervals_filepath, chrom=None, sep='\t'):

    intervals = pd.read_csv(intervals_filepath, header=None, sep=sep).iloc[:, :3]
    if chrom != None: intervals = intervals[intervals.iloc[:, 0] == chrom]

    return list(intervals.itertuples(index=False, name=None))

import warnings
from tables import NaturalNameWarning
warnings.filterwarnings('ignore', category=NaturalNameWarning)
cols = ['start','end','motif','score','strand','TF']

for i in ['1', '2']:

    print('reading', i)
    intervals = intervals_from_bedlike(f'/Users/alex/Desktop/k562_jaspar_quick/K562_ATAC_peaks_{i}.csv', sep=',')
    with open(f'/Users/alex/Desktop/k562_jaspar_quick/K562_ATAC_peaks_{i}_tfbs.pickle', 'rb') as f:
        K562_ATAC_peaks_tfbs = pickle.load(f)

    print('writing', i)
    for ival, tfbs in zip(intervals, K562_ATAC_peaks_tfbs):
        df = pd.DataFrame([[row[0], row[1]]+row[2].split('\t') for row in tfbs], columns=cols)
        try: df.to_hdf(f'/Users/alex/Desktop/k562_jaspar_quick/K562_ATAC_peaks_{i}_tfbs.hdf', f'{_str(*ival)}/jaspar2022_TFBS')
        except Exception as e:
            print(e)
1
/usr/local/lib/python3.7/site-packages/tables/group.py:499: PerformanceWarning: group ``/`` is exceeding the recommended maximum number of children (16384); be ready to see PyTables asking for *lots* of memory and possibly slow I/O.
  PerformanceWarning)

Generate reports for each peak

plt.rcParams['figure.figsize'] = [12, 5]

peak_index = 9
df = K562_ATAC_2_tfbs_and_chip[peak_index]
df = pd.concat((K562_TF_nanomolar_conc[K562_TF_nanomolar_conc > 0].rename('TF_conc').to_frame(), df), axis=1)
df = df[df.index.isin(ChIPable_CISBPable_TFs)]
df = df.fillna(0)
df = df[df.pwm_score >= 0] # confusing, look into this later
# K562_ATAC_peaks_2_filtered
def set_colors(df):

    df['c'] = np.nan

    df.loc[(df.chip_signalValue > 0) & (df.pwm_score > 0), 'c'] = 'g'
    df.loc[(df.chip_signalValue > 0) & (df.pwm_score == 0), 'c'] = 'm'
    df.loc[(df.chip_signalValue == 0) & (df.pwm_score > 0), 'c'] = 'r'
    df.loc[(df.chip_signalValue == 0) & (df.pwm_score == 0), 'c'] = 'c'
    return df
df = set_colors(df)
df['chip_signalBinary'] = (df.chip_signalValue > 0).astype(int)
df['pwm_scoreBinary'] = (df.pwm_score > 0).astype(int)
def atac_peak_distrib_plot(peak_index, peaks_df=K562_ATAC_peaks_2_filtered):

    this_peak = peaks_df.loc[peak_index]

    g = sns.JointGrid(data=peaks_df, x='signalValue', y='score', xlim=(0,45), ylim=(0, 1050))
    g.plot_joint(sns.histplot)
    g.plot_marginals(sns.kdeplot)

    g.fig.suptitle(f'Peak signal percentile rank: {round(this_peak.signalValue_percentile*100, 2)}%')

    g.fig.set_figwidth(10)
    g.fig.set_figheight(5)

    g.ax_joint.plot(this_peak.signalValue, this_peak.score, markersize=12, c='red', marker="+", label=f'{peak_index}: '+this_peak['name'])
    g.ax_joint.plot(this_peak.signalValue, this_peak.score, markersize=10, c='red', marker=".")

    g.ax_marg_x.axvline(this_peak.signalValue, c='red')
    g.ax_marg_y.axhline(this_peak.score, c='red')

    g.ax_joint.legend(loc='lower right')

    return g
atac_peak_distrib_plot(peak_index)
<seaborn.axisgrid.JointGrid at 0x1b4ffe350>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-04-07T10:30:04.337129 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/
def lims(series, buffer=0.01):

    spread = series.max() - series.min()
    limits = (series.min() - spread*buffer,  series.max() + spread*buffer)
    return limits

def log_lims(series, log_buffer=1):

    lower = np.log10(series[series > 0].min()) - log_buffer
    upper = np.log10(series[series > 0].max()) + log_buffer
    return (lower, upper)
def spearman_corr(df, tfbs, chip): return df.loc[:, [tfbs, chip]].corr(method='spearman').loc[tfbs, chip]

spearman_corr(df, 'pwm_score', 'chip_signalValue'), spearman_corr(df, 'pwm_score_x_TF_conc', 'chip_signalValue')
(0.18737706128918927, 0.19541964595138492)
def plot_peak_scatter(df, mode='kde', pred_col='pwm_score'):

    palette={'g':'g','m':'m','r':'r','c':'c'}

    g = sns.JointGrid(data=df, x=pred_col, y='chip_signalValue', xlim=lims(df[pred_col]), ylim=(-1, 5e4), hue='c', palette=palette)
    g.plot_joint(sns.scatterplot, alpha=0.5)

    if mode == 'hist':
        sns.histplot(data=df, x=pred_col, bins=100, ax=g.ax_marg_x, hue='chip_signalBinary', palette={0:'r', 1:'g'}, legend=False, log_scale=(False, True),element="step")
        sns.histplot(data=df, y='chip_signalValue', bins=[0, 1]+np.logspace(1, 5).tolist(), ax=g.ax_marg_y, hue='pwm_scoreBinary', palette={0:'m', 1:'g'}, legend=False, log_scale=(True, False),element="step")

        g.ax_marg_y.tick_params(labeltop=True)
        g.ax_marg_y.grid(True, axis='x', ls=':')

        g.ax_marg_x.tick_params(labelleft=True)
        g.ax_marg_x.grid(True, axis='y', ls=':')

    elif mode=='kde':
        sns.kdeplot(data=df[df[pred_col] > 0], x=pred_col, ax=g.ax_marg_x, hue='chip_signalBinary', palette={0:'r', 1:'g'}, fill=True, alpha=0.1, legend=False)
        sns.kdeplot(data=df[df.chip_signalValue > 0], y='chip_signalValue', ax=g.ax_marg_y, hue='pwm_scoreBinary', palette={0:'m', 1:'g'}, fill=True, alpha=0.1, legend=False, log_scale=(False, True))

    g.fig.set_figwidth(10)
    g.fig.set_figheight(5)

    corr = spearman_corr(df, pred_col, 'chip_signalValue')
    g.fig.suptitle(f'Spearman_correlation: {round(corr, 4)}')

    handles, labels = g.ax_joint.get_legend_handles_labels()
    replacement = {'g':'Motif match and ChIP peak',
                   'm':'ChIP peak without motif match',
                   'r':'Motif match without ChIP peak',
                   'c':'Neither motif match nor ChIP Peak'}
    labels = [replacement[x] for x in labels]
    legend = g.ax_joint.legend(handles, labels, loc='upper left')

    g.ax_joint.set_yscale('symlog', linthresh=min(10, df[df.chip_signalValue > 0].chip_signalValue.min()))

    return g, corr
g, corr1 = plot_peak_scatter(df, pred_col='pwm_score')
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-04-07T10:30:05.148247 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/
g, corr2 = plot_peak_scatter(df, pred_col='pwm_score_x_TF_conc')
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-04-07T10:30:05.561010 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/
def hypergeom_pval(df, universe=ChIPable_CISBPable_TFs, tfbs='pwm_score', chip='chip_signalValue'):

    TFs_with_chip_signal = set(df[df[chip] > 0].index)
    TFs_with_tfbs = set(df[df[tfbs] > 0].index)

    # the total number of marbles in the jar
    M = len(ChIPable_CISBPable_TFs)

    # the number of red marbles in the jar
    n = len(TFs_with_chip_signal)

    # the number of draws from the jar
    N = len(TFs_with_tfbs)

    # the number of red marbles drawn
    x = len(TFs_with_chip_signal & TFs_with_tfbs)

    pval = hypergeom.sf(x-1, M, n, N)

    return pval


def pwm_chip_overlap_venn(df):

    fig, ax = plt.subplots()

    TFs_with_chip_signal = set(df[df.chip_signalValue > 0].index)
    TFs_with_tfbs = set(df[df.pwm_score > 0].index)

    pval = hypergeom_pval(df)

    venn3((set(ChIPable_CISBPable_TFs), TFs_with_tfbs, TFs_with_chip_signal),
          (f'All TFs we have PWMs and ChIP data for ({len(ChIPable_JASPARable_TFs)})',
           f'Has non-zero PWM score at locus ({len(TFs_with_tfbs)})',
           f'Has non-zero ChIP signal at locus ({len(TFs_with_chip_signal)})'))

    fig.suptitle(f'Hypergeometric -log10(pval): {round(-np.log10(pval),2)}')

    return fig, pval
ax, pval = pwm_chip_overlap_venn(df)
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-04-07T13:43:34.812882 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/
stats = rolling_PPV_TPR_FPR(df, tfbs='pwm_score', chip='chip_signalValue').iloc[1:]
def plot_PR(stats):

    stats = stats.rename(columns={'positive_predictive_value':'precision', 'true_positive_rate': 'recall'})
    ax = stats.plot.line(y='precision', x='recall', legend=False, xlim=(0, 1), ylim=(0, 1))
    ax.set_aspect('equal', adjustable='box')
    ax.set_ylabel('precision')
    ax.set_xlabel('recall')

    f_scores = np.linspace(0.2, 0.8, num=4)
    lines, labels = [], []
    for f_score in f_scores:
        x = np.linspace(0.01, 1)
        y = f_score * x / (2 * x - f_score)
        (l,) = plt.plot(x[y >= 0], y[y >= 0], color="gray", alpha=0.2)
        plt.annotate("f1={0:0.1f}".format(f_score), xy=(0.9, y[45] + 0.02))

    top_f1 = stats.f1.max()
    ax.set_title(f'Best F1 score: {round(top_f1, 2)}')

    return ax, top_f1
ax, top_f1 = plot_PR(stats)
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-04-07T13:43:36.101047 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/
def clean_peak_df(peak_df):

    peak_df = pd.concat((K562_TF_nanomolar_conc.reindex(ChIPable_CISBPable_TFs).rename('TF_conc').to_frame(), peak_df), axis=1).fillna(0)
    peak_df = peak_df[peak_df.pwm_score >= 0] # confusing, look into this later
    peak_df = set_colors(peak_df)
    peak_df['chip_signalBinary'] = (peak_df.chip_signalValue > 0).astype(int)
    peak_df['pwm_scoreBinary'] = (peak_df.pwm_score > 0).astype(int)
    return peak_df
def peak_summary(peak_index):

    peak_df = clean_peak_df(K562_ATAC_2_tfbs_and_chip[peak_index])

    g1 = atac_peak_distrib_plot(peak_index)

    g2 = plot_peak_scatter(peak_df)

    ax, hypergeom_pval = pwm_chip_overlap_venn(peak_df)

    stats = rolling_PPV_TPR_FPR(peak_df, tfbs='pwm_score', chip='chip_signalValue').iloc[1:]
    ax, top_f1 = plot_PR(stats)
    ax, auroc = plot_AUROC(stats)

    return peak_df
peak_df = peak_summary(9)
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-04-07T13:43:37.426045 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-04-07T13:43:38.020791 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-04-07T13:43:38.148149 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-04-07T13:43:38.217033 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-04-07T13:43:38.290944 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/
def detailed_peak_summary():
    pass
 

Evaluating across all ATAC peaks

def peak_stats(peak_index):

    atac_signalValue_percentile = K562_ATAC_peaks_2_filtered.loc[peak_index].at['signalValue_percentile']

    peak_df = clean_peak_df(K562_ATAC_2_tfbs_and_chip[peak_index])

    num_TFs_overall = len(peak_df)
    num_chip_peaks = peak_df.chip_signalBinary.sum()
    num_tfbs_hits = peak_df.pwm_scoreBinary.sum()
    num_phantom_tfs = len(peak_df[(peak_df.TF_conc == 0) & (peak_df.chip_signalValue > 0)])

    corr_without_rna = spearman_corr(peak_df, 'pwm_score', 'chip_signalValue')
    corr_with_rna = spearman_corr(peak_df, 'pwm_score_x_TF_conc', 'chip_signalValue')

    pval = hypergeom_pval(peak_df)

    stats = rolling_PPV_TPR_FPR(peak_df, tfbs='pwm_score', chip='chip_signalValue').iloc[1:]
    top_f1 = stats.f1.max()
    auroc = AUROC(stats)

    return (atac_signalValue_percentile,
            num_TFs_overall,
            num_chip_peaks,
            num_tfbs_hits,
            num_phantom_tfs,
            corr_without_rna,
            corr_with_rna,
            pval,
            top_f1,
            auroc)
#         'num_TFs_overall',
#         'num_chip_peaks',
#         'num_tfbs_hits',
#         'num_phantom_tfs',
#         'corr_without_rna',
#         'corr_with_rna',
#         'pval',
#         'top_f1',
#         'auroc')

# metrics_df = {i: peak_stats(i) for i in K562_ATAC_2_tfbs_and_chip.keys()}
# metrics_df = pd.DataFrame(metrics_df, index=cols).T
# metrics_df.to_csv('../data/equilibrium_TFs/K562_ATAC_PWM_vs_ChIP_baseline_metrics.csv')
metrics_df = pd.read_csv('../data/equilibrium_TFs/K562_ATAC_PWM_vs_ChIP_baseline_metrics.csv', index_col=0)
metrics_df = pd.concat((K562_ATAC_peaks_2_filtered.signalValue, metrics_df), axis=1).rename(columns={'signalValue':'atac_signalValue'})
set_matplotlib_formats('png')
metrics_df.plot.scatter(x='atac_signalValue', y='num_chip_peaks', alpha=0.3, s=0.1)
<AxesSubplot:xlabel='atac_signalValue', ylabel='num_chip_peaks'>
metrics_df.plot.scatter(x='num_tfbs_hits', y='num_chip_peaks', alpha=0.3, s=0.1)
<AxesSubplot:xlabel='num_tfbs_hits', ylabel='num_chip_peaks'>
metrics_df
atac_signalValue atac_signalValue_percentile num_TFs_overall num_chip_peaks num_tfbs_hits num_phantom_tfs corr_without_rna corr_with_rna pval top_f1 auroc
0 4.29354 0.385592 223.0 0.0 89.0 0.0 NaN NaN 1.000000 0.000000 0.991031
1 4.60022 0.428760 223.0 0.0 89.0 0.0 NaN NaN 1.000000 0.000000 0.991031
3 2.76013 0.093013 223.0 0.0 89.0 0.0 NaN NaN 1.000000 0.000000 0.991031
6 6.59365 0.618386 223.0 0.0 89.0 0.0 NaN NaN 1.000000 0.000000 0.991031
7 3.68018 0.279991 223.0 0.0 89.0 0.0 NaN NaN 1.000000 0.000000 0.991031
... ... ... ... ... ... ... ... ... ... ... ...
203867 13.03226 0.877598 223.0 1.0 89.0 0.0 0.065980 0.088366 0.395556 0.035088 0.752252
203869 3.22015 0.186231 223.0 3.0 89.0 0.0 -0.033323 -0.003527 0.781102 0.027027 0.427273
203871 5.52026 0.531256 223.0 0.0 89.0 0.0 NaN NaN 1.000000 0.000000 0.991031
203872 3.22015 0.186231 223.0 1.0 89.0 0.0 -0.052430 -0.052430 1.000000 0.008929 0.299550
203873 3.98686 0.335915 223.0 0.0 89.0 0.0 NaN NaN 1.000000 0.000000 0.991031

107082 rows × 11 columns

peak_df = peak_summary(0)
/usr/local/lib/python3.7/site-packages/matplotlib_venn/_venn3.py:61: UserWarning:

Circle C has zero area

<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-04-07T14:00:19.277048 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-04-07T14:00:19.857992 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-04-07T14:00:19.977428 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-04-07T14:00:20.047179 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-04-07T14:00:20.122064 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/
 

per TF

# def func(i, row):
#     return pwm_vs_chip(row.at['chrom'], row.at['chromStart'], row.at['chromEnd'])

# imap_unordered(func, K562_ATAC_peaks_2_filtered.iterrows(), chunksize)
K562_ATAC_2_tfbs_and_chip
K562_TF_nanomolar_conc
tfdb2[~tfdb2.Name.isin(K562_TF_nanomolar_conc.index)]
ccre
meuleman_dhs_index
chrom chromStart chromEnd name index
0 chr1 16140 16200 1.10011 0
1 chr1 51868 52040 1.10021 1
2 chr1 66370 66482 1.10027 2
3 chr1 79100 79231 1.1003 3
4 chr1 79430 79497 1.10032 4
... ... ... ... ... ...
2806287 chrY 26670795 26670971 Y.51915 2806287
2806288 chrY 26670849 26671280 Y.51916 2806288
2806289 chrY 26671100 26671485 Y.51918 2806289
2806290 chrY 26671584 26671699 Y.51919 2806290
2806291 chrY 56679795 56679899 Y.99105 2806291

2806292 rows × 5 columns

pd.read_csv('../../AChroMap/data/raw/ENCODE/cCRE/GRCh38_closest_PLS_cCRE.csv')
transcript_id promoter distance
0 ENST00000263100.7 EH38E1966268 10.0
1 ENST00000596924.1 EH38E1966264 64.0
2 ENST00000598345.1 EH38E1966264 41.0
3 ENST00000595014.1 EH38E1966268 18.0
4 ENST00000600966.1 EH38E1966268 380.0
... ... ... ...
206652 ENST00000572699.1 NaN NaN
206653 ENST00000570365.1 NaN NaN
206654 ENST00000572831.1 EH38E1841931 -40573.5
206655 ENST00000575428.1 EH38E1841931 -43081.5
206656 ENST00000609567.1 EH38E1841202 -1164.0

206657 rows × 3 columns

Later

#  need to just find which regions are open -- sort of like a binary thing on the cCREs. Maybe they have such a table already
#  need to make multi-wigs including TFBS scores, TFBS scores * expression, and full model.
#  I need to scan the entire reference genome with the full set of motif models -- need to do that on a cluster.
# Resgen: chip, dgf, and atac
# chip: bigwig, bed
# dgf: bed
# atac: bigwig, bed

Plot

strand = reference_transcript.strand.iat[0]
tss, strand
(47378576, -1)
arrow_direction = 8 if strand == -1 else 9
k562_dgf
contig start stop identifier mean_signal num_samples num_fps width summit_pos core_start core_end motif_clusters TFs K562-DS15363_posterior K562-DS16924_posterior K562-DS15363_binary K562-DS16924_binary
0 chr11 47378353 47378364 11.4156904.4 7.638194 1 1 11 47378358 47378358 47378358 GRHL [[GRHL1, GRHL2, TFCP2]] 0.186745 0.481944 0 0
1 chr11 47378391 47378399 11.4156904.7 11.159245 2 2 8 47378394 47378394 47378395 GC-tract;GLIS;ZNF563 [[KLF15, MAZ, THAP1, ZNF263, ZNF341, ZNF467, Z... 1.777505 0.212817 0 0
2 chr11 47378444 47378464 11.4156909.1 99.355537 13 19 20 47378454 47378432 47378462 E2F/2;Ebox/CACCTG;Ebox/CAGCTG;KLF/SP/1;KLF/SP/... [[E2F1, E2F3, E2F4, E2F6, E2F7, TFDP1], [FIGLA... 3.286462 3.628242 0 0
3 chr11 47378497 47378556 11.4156909.2 92.230072 12 17 59 47378539 47378498 47378552 CTCF;ETS/2;KLF/SP/1;KLF/SP/2;MAF;NFKB/1;NFKB/2... [[CTCF, CTCFL], [EHF, ELF2, ELF3, ELF5, ELK4, ... 6.373700 4.206899 1 0
4 chr11 47378590 47378607 11.4156909.3 205.326386 25 26 17 47378598 47378590 47378603 GRHL;SREBF1 [[GRHL1, GRHL2, TFCP2], [SREBF1, SREBF2]] 7.060753 9.039158 1 1
5 chr11 47378641 47378657 11.4156909.4 325.041038 30 31 16 47378649 47378644 47378653 MBD2;NR/19;TFAP2/1 [[MBD2], [NR1D1, NR1D2, RORA, RORB, RORC], [TF... 10.623709 16.346320 1 1
6 chr11 47378700 47378713 11.4156909.5 94.406494 15 18 13 47378705 47378677 47378718 CTCF;E2F/2;Ebox/CACCTG;GLIS;INSM1;NFI/1;NR/3;N... [[CTCF, CTCFL], [E2F1, E2F3, E2F4, E2F6, E2F7,... 4.565677 6.131198 0 1
7 chr11 47378755 47378766 11.4156909.6 10.278967 2 2 11 47378761 47378760 47378762 E2F/2;ETS/2;EWSR1/FLI1;GC-tract;IRF/2;KLF/SP/2... [[E2F1, E2F3, E2F4, E2F6, E2F7, TFDP1], [EHF, ... 0.321546 0.526205 0 0
8 chr11 47378828 47378836 11.4156909.7 6.116223 1 1 8 47378832 47378832 47378832 NaN [] 0.321470 0.184080 0 0
9 chr11 47378876 47378895 11.4156915.1 11.757207 2 2 19 47378885 47378880 47378890 E2F/2;ETS/1;ETS/2;GC-tract;KLF/SP/2 [[E2F1, E2F3, E2F4, E2F6, E2F7, TFDP1], [EHF, ... 0.837988 0.150539 0 0
# opacity is the posterior
# names are the TF classes that could be bound
# I should get the binary file too, rather than only the full posterior
footprint_binary_avg
----------------------------------------------------------
NameError                Traceback (most recent call last)
<ipython-input-505-43571d82ec46> in <module>
----> 1 footprint_binary_avg

NameError: name 'footprint_binary_avg' is not defined
fig, axs = plt.subplots(nrows=2, ncols=1, sharex=True)
dgf_ax = axs[0]
chip_ax = axs[1]

samples = ['K562-DS15363','K562-DS16924']

chip_ax.set(xlabel='Genome', xlim=[start, stop])


for i, dgf in k562_dgf.iterrows():

    footprint_score_normalized = [np.clip(np.log10(dgf[sample+'_posterior'])/2, 0,1) for sample in samples]
    footprint_binary_avg = np.mean([dgf[sample+'_binary'] for sample in samples])

    dgf_ax.add_patch(Rectangle((dgf.start, 0), dgf.stop-dgf.start, 0.8, alpha=max(*footprint_score_normalized), ec=cm.Greys(footprint_binary_avg)))

    # tss
    dgf_ax.plot(tss, 0.5, lw=0, markersize=10, c='red', marker=arrow_direction)
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-02-28T16:29:22.251962 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/
# some TFs will be bound
# from those with ChIP or DGF
# can I tell which is actually bound, and which isn't?
# then I want to make a similar plot with darkness being tfbs strength * expression
# then correlate TFBS strength * expression with ChIP intensity overlapping that TFBS
# can make a scatterplot: ChIP on the X or something, and TFBS strength * expression on the Y -- or reverse
# subset ChIP peaks to tfs in TFDB?
k562_chip
chrom chromStart chromEnd name signalValue
0 chr11 47377411 47379111 RBFOX2 310.05267
1 chr11 47377645 47378235 NFRKB 47.98813
2 chr11 47377730 47378190 ZNF639 39.63035
3 chr11 47377781 47378485 BRD4 43.30838
4 chr11 47377803 47378483 ZNF175 16.40253
... ... ... ... ... ...
202 chr11 47378666 47379076 UBTF 8.42097
203 chr11 47378699 47379035 ERF 50.17644
204 chr11 47378709 47379045 SHOX2 42.63161
205 chr11 47378832 47379248 PRDM10 53.45506
206 chr11 47378853 47379277 CBFB 39.00836

207 rows × 5 columns

k562_dgf
contig start stop identifier mean_signal num_samples num_fps width summit_pos core_start core_end motif_clusters TFs K562-DS15363 K562-DS16924
0 chr11 47378353 47378364 11.4156904.4 7.638194 1 1 11 47378358 47378358 47378358 GRHL [[GRHL1, GRHL2, TFCP2]] 0.186745 0.481944
1 chr11 47378391 47378399 11.4156904.7 11.159245 2 2 8 47378394 47378394 47378395 GC-tract;GLIS;ZNF563 [[KLF15, MAZ, THAP1, ZNF263, ZNF341, ZNF467, Z... 1.777505 0.212817
2 chr11 47378444 47378464 11.4156909.1 99.355537 13 19 20 47378454 47378432 47378462 E2F/2;Ebox/CACCTG;Ebox/CAGCTG;KLF/SP/1;KLF/SP/... [[E2F1, E2F3, E2F4, E2F6, E2F7, TFDP1], [FIGLA... 3.286462 3.628242
3 chr11 47378497 47378556 11.4156909.2 92.230072 12 17 59 47378539 47378498 47378552 CTCF;ETS/2;KLF/SP/1;KLF/SP/2;MAF;NFKB/1;NFKB/2... [[CTCF, CTCFL], [EHF, ELF2, ELF3, ELF5, ELK4, ... 6.373700 4.206899
4 chr11 47378590 47378607 11.4156909.3 205.326386 25 26 17 47378598 47378590 47378603 GRHL;SREBF1 [[GRHL1, GRHL2, TFCP2], [SREBF1, SREBF2]] 7.060753 9.039158
5 chr11 47378641 47378657 11.4156909.4 325.041038 30 31 16 47378649 47378644 47378653 MBD2;NR/19;TFAP2/1 [[MBD2], [NR1D1, NR1D2, RORA, RORB, RORC], [TF... 10.623709 16.346320
6 chr11 47378700 47378713 11.4156909.5 94.406494 15 18 13 47378705 47378677 47378718 CTCF;E2F/2;Ebox/CACCTG;GLIS;INSM1;NFI/1;NR/3;N... [[CTCF, CTCFL], [E2F1, E2F3, E2F4, E2F6, E2F7,... 4.565677 6.131198
7 chr11 47378755 47378766 11.4156909.6 10.278967 2 2 11 47378761 47378760 47378762 E2F/2;ETS/2;EWSR1/FLI1;GC-tract;IRF/2;KLF/SP/2... [[E2F1, E2F3, E2F4, E2F6, E2F7, TFDP1], [EHF, ... 0.321546 0.526205
8 chr11 47378828 47378836 11.4156909.7 6.116223 1 1 8 47378832 47378832 47378832 NaN [] 0.321470 0.184080
9 chr11 47378876 47378895 11.4156915.1 11.757207 2 2 19 47378885 47378880 47378890 E2F/2;ETS/1;ETS/2;GC-tract;KLF/SP/2 [[E2F1, E2F3, E2F4, E2F6, E2F7, TFDP1], [EHF, ... 0.837988 0.150539
tfdb
family log_pwm pwm e4 e5 e6 e7 e8 moving_min energy_pwm abs_pwm
TFAP2A AP-2 [[-0.5324, 0.34450000000000003, -1.5585, 0.530... [[0.1692883895, 0.295380774, 0.043196005, 0.49... 8.7416 10.7077 11.6782 12.0994 12.0994 [0, 0, 0, 0, -6.3694, -1.4235, 1.9815, 4.1161,... [[-0.6340118223862905, -0.3061541758545828, -1... [[-0.039394774342643046, 0.28846287218906463, ...
TFAP2B AP-2 [[-2.4892, 0.5962000000000001, 1.0115, -2.6127... [[0.023064020100000002, 0.3802068773, 0.576460... 8.4778 9.0331 9.0848 9.0848 9.0848 [0, 0, -5.4818, 0.465, 3.0712, 4.4056, 5.7584,... [[-1.7703176580930648, -0.2508217647175856, 0.... [[-0.6197024527548884, 0.8997934406205907, 1.1...
TFAP2C AP-2 [[-0.5975, -0.163, -1.1323, 0.7243], [-1.3232,... [[0.1585533082, 0.1774082962, 0.0666780939, 0.... 8.5386 10.4668 11.7665 12.5488 12.5707 [0, 0, 0, 0, 0, -5.1814, -0.7559, 1.8075999999... [[-0.7891465987597153, -0.72386760824293, -1.2... [[-0.24234438158701355, -0.17706539107022823, ...
TFAP2D AP-2 [[0.8455, -1.4861, 0.1063, -1.8089], [-5.6699,... [[0.6744186047, 0.046511627900000005, 0.232558... 5.9343 11.2463 13.9767 15.7208 16.6712 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.352... [[0.0, -1.535435522302642, -0.6386014303304112... [[0.4645947195740855, -1.0708408027285565, -0....
TFAP2E AP-2 [[-5.6699, 0.43660000000000004, 1.1706, -5.669... [[0.0, 0.3239601366, 0.6760398634, 0.0], [0.0,... 9.4399 10.3061 10.4818 10.4818 10.4818 [0, 0, -9.7782, -2.734, -0.7317, 4.58820000000... [[-2.6030628737168895, -0.4431944813386707, 0.... [[-1.4392600512677005, 0.7206083411105183, 1.1...
... ... ... ... ... ... ... ... ... ... ... ...
BATF3 bZIP [[-0.8147000000000001, 0.11900000000000001, 0.... [[0.1274089936, 0.2355460385, 0.5074946467, 0.... 6.8146 9.8562 12.2022 13.6464 13.7980 [0, 0, 0, 0, 0, -8.314, -4.9175, -2.5171, 0.05... [[-0.8163368838994554, -0.45895476909304567, 0... [[-0.5018744788450056, -0.14449236403859583, 0...
CREB1 bZIP [[-0.049100000000000005, -0.20500000000000002,... [[0.2750929368, 0.17007434940000002, 0.4344795... 7.4130 9.0542 9.9885 9.9885 9.9885 [0, 0, -1.1216, 0.4701, 1.9855, 3.4997, 4.8477... [[-0.27339056195788436, -0.556234955164997, 0.... [[0.15914494435293164, -0.123699448854181, 0.4...
TP53 p53 [[0.1729, -0.6791, 0.6519, -0.661], [0.7944, -... [[0.3437448898, 0.10548928590000001, 0.4020237... 6.7839 9.9738 12.5195 14.8391 16.6793 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4.8618... [[-0.09388527227733984, -0.7830107128893676, 0... [[0.13210745186617728, -0.5570179887458505, 0....
TP63 p53 [[-0.5391, -1.3228, 1.0218, -0.394300000000000... [[0.1681465039, 0.0549389567, 0.58240843510000... 0.7519 5.9049 10.4201 14.4116 19.4253 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.390... [[-0.7397244747417906, -1.360984020084028, 0.0... [[-0.44069621352043253, -1.0619557588626698, 0...
TP73 p53 [[0.08610000000000001, -0.6413, 0.738200000000... [[0.3150683151, 0.1095891096, 0.4383564384, 0.... 6.9394 10.2036 12.6079 14.4786 15.8512 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4.6383... [[-0.19795499653149584, -0.8135600682680643, 0... [[0.02101255270508004, -0.5945925190314884, 0....

1080 rows × 11 columns

 
# and then ultimately we want an AUROC for both models -- basic and AChroMap
# do I want to do that myself in matplotlib, or use resgen or plotly?
# make multi-wigs
# script building views
# script going to a particular spot in the genome in a view
# make a way to plot
# - the ChIP datasets, DGF
# - the binding energy tracks
# - The imputed TF concentration

# I need to be able to use my tool to make predictions about binding

# for that I need nucleosome energies and MCMC up and running

# Do I want a bed file or a bigwig?
# make some code that turns something like the above into a bigWig and uploads it to ResGen

# Are there TFs from the full list of human DNA-binding TFs which we can chip for which we don’t have motifs for?

Probably no longer relevant:

Let's now proceed to a slightly more complex regulatory region, with 20 binding sites for 10 TFs

<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-02-04T11:37:39.954187 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/

Notice first that when binding sites overlap slightly, we've introduced a free energy penalty of competition, denoted by a pink line between the binding sites, and when they stronly overlap, we've disallowed those configurations, denoted by a red line.

We notice a problem with evaluating our $p_\mathrm{config}$ expression with realistic regulatory DNA: the number of configurations grows exponentially in the number of transcription factor binding sites ($2^{|\mathrm{TFBS}|}$). Only 20 Transcription Factor Binding sites entails ~1 million configurations, and so ~1 million terms in the denominator $Z$. Computing probabilities of configurations exactly then becomes intractable for realistic scenarios: we need to proceed by sampling.

Thankfully there's a dynamic programming approach to computing Z.

config = np.round(np.random.rand(len(TFBSs))).astype(int)

Which we can plot, and compute the associated log-statistical-weight:

axs = draw_config(config)
plt.tight_layout()
print('log(p_config) =', log_P_config(config))
log(p_config) = -1.4748275325663975
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-02-03T18:15:21.883707 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/

Multiple TFs: Nucleosome-mediated Cooperativity

nucleosomes as a soft AND gate

Direct cooperativity (driven by protein-protein interactions) between transcription factors exerts significant control over transcription when it occurs, but it appears to be the exception, not the rule. More common is indirect cooperativity mediated by nucleosomes. Nucleosomes can be thought of as repressive transcription factors with large footprints and relatively non-specific sequence-dependent binding affinities. When multiple transcription factors bind independently in the footprint of a nucleosome, they do so cooperatively, despite not making physical contact with one another.

Including nucleosomes in our description of chromatin produces a new set of interactions, which require energy parameters. Those are:

  • Nucleosome:DNA interactions
  • Nucleosome:TF interactions
  • Nucleosome:Nucleosome interactions

Nucleosome:DNA

Nucleosomes have at least a 5000-fold range of affinities for differing DNA sequences.

What controls nucleosome positions? Nucleosomes do not bind specific 147-bp sequences, but they do have preferences for certain periodic patterns, and disfavor certain other patterns. Nucleosomes are canonically wrapped by 147bp of DNA, but due to nucleosome crowding and nucleosome beathing, they are often wrapped by fewer bases.

For our model, we will say that any stretch of DNA [of lengths 100-147bp] can be wrapped by a nucleosome, with an energy which is derived from the [sequence model here].

def nucleosome_energy_from_sequence(sequence):
    '''
    tells you the binding energy
    '''
    pass
 

Nucleosome:TF

Nucleosomes and Transcription Factors were considered to sterically block one another from binding the same stretch of DNA, however, we now know that many TFs can co-bind DNA and stabilize nucleosomal binding to a piece of DNA.

For simplicity, we'll assume we're only considering transcription factors which cannot co-bind DNA with nucleosomes, and impose a large ΔG penalty for nucleosome and TF co-binding.

Nucleosome:Nucleosome

Nucleosomes are spaced by stretches of free DNA called linker DNA, which have characterisitic lengths. Those specific lengths are believed to facilitate nucleosome arrays forming higher-order chromatin structures.

def energy_of_spacing_between_nucleosomes(nucleosome_positions):
    '''
    '''
    pass
nanomolar_kd_from_kcal_ΔG(23.8*kbT)
0.04610959744808222
nanomolar_kd_from_kcal_ΔG(14.4*kbT)
557.3903692694596
# Resgen: chip, dgf, and atac
# chip: bigwig, bed
# dgf: bed
# atac: bigwig, bed

send blog post for feedback:

  • Jeremy / Leonid
  • Advait / Bin Zhang
  • Anders
  • Kellis lab
  • Vlad Teif, other online nucleosome people
  • TF people?
  • Muir Morrison / Rob Philips
  • Bruce Tidor?

transfer matrix: probably won't use

# the PWMs are concatenated head to tail on the y/i axis (and expanded ACGT). The next base is on the x/j axis: there are 4 values in the row.

# the start of each motif is that value * concentration

# I need hypothetical absolute max kd's ΔG's. Then everything is relative to that. Does the transfer matrix sum energies or something else?

# how does the particular sequence enter the matrix multiplication? You're choosing a path through the matrix.
# Does that mean the matrix at each base is different? Or there are 4 matrices again?
# yeah I think there have to be 4 matrices (or 6 if you include methylation). Wait since matrices are for each pair of positions, we need 16 acctually?
sum(tfdb.log_pwm.apply(len))
12757
transfer_matrix_a = np.zeros((len(tf_states), len(tf_states)))
# transfer_matrix_c = np.zeros((len(tf_states), len(tf_states)))
# transfer_matrix_g = np.zeros((len(tf_states), len(tf_states)))
# transfer_matrix_t = np.zeros((len(tf_states), len(tf_states)))
tfdb.log_pwm
TFAP2A    [[-0.5324, 0.34450000000000003, -1.5585, 0.530...
TFAP2B    [[-2.4892, 0.5962000000000001, 1.0115, -2.6127...
TFAP2C    [[-0.5975, -0.163, -1.1323, 0.7243], [-1.3232,...
TFAP2D    [[0.8455, -1.4861, 0.1063, -1.8089], [-5.6699,...
TFAP2E    [[-5.6699, 0.43660000000000004, 1.1706, -5.669...
                                ...                        
BATF3     [[-0.8147000000000001, 0.11900000000000001, 0....
CREB1     [[-0.049100000000000005, -0.20500000000000002,...
TP53      [[0.1729, -0.6791, 0.6519, -0.661], [0.7944, -...
TP63      [[-0.5391, -1.3228, 1.0218, -0.394300000000000...
TP73      [[0.08610000000000001, -0.6413, 0.738200000000...
Name: log_pwm, Length: 1080, dtype: object
base_index = {'a':0,'c':1,'g':2,'t':3}
tf_states = pd.Index([(tf, i) for tf, log_pwm in tfdb.log_pwm.items() for i in range(len(log_pwm))])
def transfer_matrix_index(TF, offset): return tf_states.get_loc((TF, offset))
for TF, pwm in tfdb.log_pwm.items():
    for offset, row in enumerate(pwm):

        i = transfer_matrix_index(TF, offset)

        transfer_matrix_a[i][i+1] = pwm[offset][base_index['a']]
----------------------------------------------------------
IndexError               Traceback (most recent call last)
<ipython-input-231-ffe1fe50396a> in <module>
      4         i = transfer_matrix_index(TF, offset)
      5 
----> 6         transfer_matrix_a[i][i+1] = pwm[offset][base_index['a']]
      7 

IndexError: index 12757 is out of bounds for axis 0 with size 12757

Verify against proteomics

df = pd.read_csv('https://gygi.hms.harvard.edu/data/ccle/protein_quant_current_normalized.csv.gz')

df = df[['Protein_Id',
         'Gene_Symbol',
         'Description',
         'Group_ID',
         'Uniprot',
         'Uniprot_Acc',
         'K562_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE_TenPx25']].rename(columns={'K562_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE_TenPx25':'Gygi_K562'})
updates, missing, names_updated = namespace_mapping(df.Gene_Symbol)
passed 12755 symbols
31 names contained whitespace. Stripping...
558 duplicates. 12197 uniques.

querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
querying 10001-11000...done.
querying 11001-12000...done.
querying 12001-12197...done.
Finished.
728 input query terms found dup hits:
	[('ARX', 2), ('NCL', 2), ('CNP', 2), ('RPF1', 2), ('HYPK', 2), ('FH', 3), ('RPL17', 3), ('ATF2', 2),
13 input query terms found no hit:
	['FLJ22184', 'L1RE1', 'MT-ND5', 'MT-ND4', 'MT-CO1', 'MT-CYB', 'MT-ND1', 'GAGE3', 'MT-ATP8', 'hCG_198

unchanged: 11537;  updates: 647;  missing: 13
df.Gene_Symbol = names_updated
k562_TF_proteomics = df.groupby('Gene_Symbol')['Gygi_K562'].max().reindex(tfdb.index)
sum(k562_TF_proteomics.notnull()) / len(k562_TF_proteomics)
0.387037037037037
k562_TF_proteomics
TFAP2A   -2.802736
TFAP2B   -2.548439
TFAP2C   -3.410715
TFAP2D         NaN
TFAP2E         NaN
            ...   
BATF3          NaN
CREB1     0.484013
TP53     -3.214319
TP63     -0.861330
TP73           NaN
Name: Gygi_K562, Length: 1080, dtype: float64
tf_proteomics_validation = pd.concat((K562_TF_nanomolar_conc.rename('[TF] predicted from ENCODE RNA'), k562_TF_proteomics), axis=1)
tf_proteomics_validation
[TF] predicted from ENCODE RNA Gygi_K562
TFAP2A 0.519441 -2.802736
TFAP2B 0.121646 -2.548439
TFAP2C 0.155407 -3.410715
TFAP2D 0.000000 NaN
TFAP2E 0.187127 NaN
... ... ...
BATF3 3.407311 NaN
CREB1 80.409642 0.484013
TP53 14.273234 -3.214319
TP63 4.264729 -0.861330
TP73 3.747720 NaN

1080 rows × 2 columns

supp3 = pd.read_excel('~/Desktop/geiger_supp/mcp.M111.014050-3.xlsx', skiprows=[0])
supp3 = supp3[['Protein IDs', 'Protein Names', 'Gene Names', 'Uniprot', 'Peptides','Razor + unique Peptides', 'Unique Peptides', 'Sequence Coverage [%]','Mol. Weight [kDa]', 'PEP', 'iBAQ K562_1','iBAQ K562_2', 'iBAQ K562_3','LFQ Intensity K562_1', 'LFQ Intensity K562_2', 'LFQ Intensity K562_3',]]
supp3['Gene Names'] = supp3['Gene Names'].str.split(';')
sum(supp3['Gene Names'].isna()) / len(supp3)
0.060758414997869624
supp3_genes = [x for l in supp3['Gene Names'].values if str(l) != 'nan' for x in l]
updates, missing, names_updated = namespace_mapping(supp3_genes)
passed 30872 symbols
2285 duplicates. 28587 uniques.

querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
querying 10001-11000...done.
querying 11001-12000...done.
querying 12001-13000...done.
querying 13001-14000...done.
querying 14001-15000...done.
querying 15001-16000...done.
querying 16001-17000...done.
querying 17001-18000...done.
querying 18001-19000...done.
querying 19001-20000...done.
querying 20001-21000...done.
querying 21001-22000...done.
querying 22001-23000...done.
querying 23001-24000...done.
querying 24001-25000...done.
querying 25001-26000...done.
querying 26001-27000...done.
querying 27001-28000...done.
querying 28001-28587...done.
Finished.
1899 input query terms found dup hits:
	[('GJC1', 2), ('ATP6C', 2), ('A15', 2), ('MYPOP', 2), ('LAG1', 4), ('LASS1', 2), ('UOG1', 2), ('SDP1
8595 input query terms found no hit:
	['FAM1B', 'DKFZp686P0738', 'hCG_23354', 'KIAA0693', 'HSPC198', 'BM-004', 'NKD', 'PP7246', 'hCG_18772

unchanged: 9507;  updates: 10576;  missing: 8595
supp3['gene_name'] = [[] for _ in range(len(supp3))]

for i, names in supp3['Gene Names'].items():
    if str(names) != 'nan':
        for name in names:
            if name in updates:
                supp3.iloc[i].gene_name.append(updates[name])
            elif name not in missing:
                supp3.iloc[i].gene_name.append(name)
supp3['ambiguous'] = False

for i, names in supp3['gene_name'].items():
    if len(names) == 0:
        supp3.loc[i, 'gene_name'] = np.nan
    elif all(name == names[0] for name in names):
        supp3.loc[i, 'gene_name'] = names[0]
    else:
        supp3.loc[i, 'ambiguous'] = True
len(supp3[supp3.ambiguous]) / len(supp3)
0.11231359181934385
Mann_K562 = supp3[supp3.gene_name.isin(tfdb.index)].set_index('gene_name').drop('Gene Names', axis=1)
Mann_K562 = Mann_K562[['LFQ Intensity K562_1','LFQ Intensity K562_2','LFQ Intensity K562_3']].rename(columns={'LFQ Intensity K562_1':'Mann_K562_1','LFQ Intensity K562_2':'Mann_K562_2','LFQ Intensity K562_3':'Mann_K562_3'})

Mann_K562 = Mann_K562.reset_index().iloc[Mann_K562.notnull().sum(axis=1).reset_index().groupby('gene_name')[0].idxmax().values].set_index('gene_name').reindex(tfdb.index)
Mann_K562 = Mann_K562.median(axis=1)
Mann_K562
TFAP2A   NaN
TFAP2B   NaN
TFAP2C   NaN
TFAP2D   NaN
TFAP2E   NaN
          ..
BATF3    NaN
CREB1    NaN
TP53     NaN
TP63     NaN
TP73     NaN
Length: 1080, dtype: float64
tf_proteomics_validation = pd.concat((tf_proteomics_validation, Mann_K562.rename('Mann_K562')), axis=1)
tf_proteomics_validation.plot.scatter(x='[TF] predicted from ENCODE RNA', y='Gygi_K562', logx=True)
<AxesSubplot:xlabel='[TF] predicted from ENCODE RNA', ylabel='Gygi_K562'>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-02-24T18:38:31.021492 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/
tf_proteomics_validation.plot.scatter(x='[TF] predicted from ENCODE RNA', y='Mann_K562', logx=True)
<AxesSubplot:xlabel='[TF] predicted from ENCODE RNA', ylabel='Mann_K562'>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-02-24T18:39:25.544396 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/
tf_proteomics_validation.plot.scatter(x='Gygi_K562', y='Mann_K562')
<AxesSubplot:xlabel='Gygi_K562', ylabel='Mann_K562'>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2022-02-24T18:39:23.630016 image/svg+xml Matplotlib v3.5.1, https://matplotlib.org/